MK-SQuIT is a synthetic dataset containing English and SPARQL query pairs. The assembly of question-query pairs is handled with very little human intervention, sidestepping the tedious and costly process of hand labeling data. A neural machine translation model can then be trained on such a dataset, allowing laymen users to access information rich knowledge graphs without an understanding of query syntax.
For further details, see our publication: MK-SQuIT: Synthesizing Questions using Iterative Template-filling
This repository contains all the tools needed to synthetically generate question/query pairs (Text2Sparql) from a given knowledge base:
- Generation Pipeline:
mk_squit/generation
- Example Data:
data
- Generated Dataset:
out
- Baseline Model Finetuning and Evaluation:
model
- Metrics:
mk_squit/metrics
- Example Entity Resolver:
mk_squit/entity_resolver
The code in this repository, while easily adapted, is engineered for a specific dataset (WikiData), query language (SPARQL), natural language (English), and set of question/query types. Generation of data can be reproduced with the following steps:
1 . Saving Raw Data: By default, the pipeline expects entity data to be saved to files named "*-5k.json" and properties to "*-props.json".
python -m scripts.gather_wikidata --data-dir data
2. Preprocessing: The data must be cleaned and annotated before fed into the pipeline.
python -m scripts.preprocess \
--data-dir ./data \
--ent-id *-5k.json \
--prop-id *-props.json \
--num-examples-to-generate 10
Several files are generated to handle critical roles:
*-5k.json -> *-5k-preprocessed.json
Entities typically have a label and alternative labels. All labels are cleaned and grouped into a single listed field.*-props.json -> *-props-preprocessed.json
Similar to entities, property labels and alternative labels are grouped. Each label is then converted into a part-of-speech tag (POS-tag) for coherent mapping within a template (ex. "set in location" -> "VERB-NOUN"). Lastly, a typing field is added of the format[domain]->
but must be annotated to include the type[domain]->[type]
.pos-examples.txt
Samples of part-of-speech tags are sorted by number of occurrences within the raw data. This is an optional file used to determine which POS-tags are of importance for template generation.
3. Annotation of Types: Each *-props-preprocessed.json
file contains a list of json objects for each property. Each property has a field of type which must be annotated before proceeding to the next step. Ex: "type": "{domain}->"
must be modified to "type": "{domain}->{type}"
.
The type specifies the general category the property falls into. Properties "location of" and "location at" could be categorized as "location" whereas "built during" and "created at" could be categorized as "time". To a certain extent, typing is subjective, but allows the pipeline to string together much more coherent statements. For a list of types that we use, refer to the WH_LABEL_DICT
within scripts/generate_type_list.py
, which maps a type to a question prefix.
Modifying WH_LABEL_DICT
within scripts/generate_type_list.py
may be necessary if additional types are required. This dictionary maps a type to its question type. For example, asking about a type of "genre" would typically require "what" - "What is the genre of that movie?". Asking about a "person" would use "who" - "Who is that person?".
While annotation of generic types and question prefixes require a manual element, they improve the generation of rational queries substantially.
4. Generate Type List:
Consolidate property types, start domains, and type metadata into a type-list-autogenerated.json
file.
python -m scripts.generate_type_list \
--data-dir ./data \
--prop-id *-props-preprocessed.json
The data is now ready to be fed into the pipeline. You should have the following files:
- Entity data:
*-5k-preprocessed.json
- Property data:
*-props-preprocessed.json
- Part-of-Speech examples (optional):
pos-examples.txt
- Type metadata list:
type-list-autogenerated.json
5. Generating the Questions and Queries:
Generating datasets with the code in its current form is very simple:
python -m mk_squint.generation.full_query_generator \
--data-dir data \
--prop-id *-props-preprocessed.json \
--ent-id *-5k-preprocessed.json \
--out-dir out
This code will generate a 100k training set and a 5k easy testing set in out
.
The code synthetically generates questions and queries using multiple layers of question/query templating. First, a baseline question template is generated from a Context-Free Grammar (CFG). Second, the baseline template is numbered according to the order of the predicates/arguments in the logical form of the template (the numbering functions are responsible for figuring this out). Third, the numbered template is ontologically typed so that when predicates and arguments (a.k.a. entities and properties) are sampled, their types do not conflict. Lastly, the predicates and arguments are sampled and inserted into the question template and into a SPARQL query template based on the numbered order of the items in the numbered and typed template. This process is explained more thoroughly in the paper.
6. Generating Test Hard dataset:
The Test Hard dataset is a variation of the Test Easy dataset that has deeper and fuzzier baseline template productions and an exclusive chemical
domain. In order to generate this dataset, some code needs to be (un)commented. Look through generation/full_query_generator.py
, generation/template_filler.py
and template_generator.py
and uncomment sections of code annotated with a TEST_HARD
comment. You might need to comment out some other code after uncommenting this code, i.e. in generation/template_generator.py
you should comment out lines 86-88 and uncomment lines 89-91. After you've made the (un)comments, simply run the same generation code in section 5:
python -m mk_squint.generation.full_query_generator
All data generated by the generator will produce files like this:
english | sparql | unique hash |
---|---|---|
What is the height of Getica's creator? | SELECT ?end WHERE { [ Getica ] wdt:P50 / wdt:P2048 ?end . } | 0ea54cd5187baf7239c8f2023ae33bb3001c5a49 |
Each stage can be modified:
Difficulty: Low-Mid
Entities and predicates of the triplestore database are necessary, with each entity having an entity type, ID, label, and any label aliases and each predicate having a predicate type, ID, label, and any label aliases. Using a typed database is critical, as the rules used to generate queries leverage semantic knowledge of entities and predicates. To this degree, some manual annotation of predicate types is required.
Difficulty: Low
The exact syntax of the generated queries can be modified in mk_squint/generation/template_filler.py
. The construct_query_pair()
function and the fill_*_ent_query()
functions would need modification to accomodate changes in syntax.
The code in this repository is designed to generate queries in SPARQL using some syntactic sugar:
SELECT ?end WHERE { [ Marcelo Bielsa ] wdt:P3448 / wdt:P31 ?end . }
Note that the entity labels have not be converted into their IDs. We leave this problem of entity resolution to downstream processes in the translation pipeline.
Difficulty: High
This work leverages the natural predicate-argument structure of (English) language to generate corresponding questions and queries. This method is perhaps generalizable to other languages that have a similar predicate-argument structure, but would be difficult to generalize to languages where that structure is less syntactically constrained. Almost all relevant code would be in mk_squint/generation/template_generator.py
.
Consider generating questions in English and then translating them over to another natural language using an off-the-shelf machine translation model.
**Difficulty: **Low-Mid
Question types are defined by context-free grammars (CFGs) located in mk_squint/generation/template_generator.py
. The type_template()
and number_*_ent()
functions may need to be modified to accomodate novel semantic constructions.
Query types, which correspond to question types, are defined by the fill_*_query()
functions present in mk_squint/generation/template_filler.py
. Novel semantic constructions may also needed to be accomodated for in construct_query_pair()
.
The code in this repository implements three question types:
-
single_entity
:What was the nationality of Michael Jordan?
-
multi_entity
:Is Michael Jordan the friend of Roger Rabbit?
-
count
:How many sons does Michael Jordan have?
@article{mk-squit,
title = {MK-SQuIT: Synthesizing Questions using Iterative Template-filling},
author = {Benjamin A. Spiegel and Vincent Cheong and James E. Kaplan and Anthony Sanchez},
journal = {arXiv preprint arXiv:2011.02566},
year = {2020},
}