This page is a step-by-step introduction of how to write an omniparser schema (specifically tailored
to the latest "omni.2.1"
schema version) and how to ingest and transform inputs programmatically
and by the CLI tool.
- Golang 1.16 installed.
github.com/jf-tech/omniparser
cloned (assuming the clone location is:~/dev/jf-tech/omniparser/
)- The guide assumes Mac OS or Linux dev environment. If you're on Windows, you need make minor adjustments from the bash commands posted below.
We'll use a simple CSV input for this guide. We have separate pages that go into details about other input formats.
Consider the following CSV input (simplified and changed from the full example here):
DATE|HIGH TEMP C|LOW TEMP F|WIND DIR|WIND SPEED KMH
01/31/2019 12:34:56-0800| 10.5| 30.2| N| 33
07/31/2020 01:23:45-0500| 39| 95| SE| 8
Bear in mind that this example is somewhat contrived (such as mixed use celsius and fahrenheit) so that we can illustrate capabilities of omniparser schema.
Omniparser ingests and transforms an input (stream) into a number of output records in JSON. For CSV a basic input unit for ingestion and transform is a single data line. For this example, we want to transform each of the data line into the following JSON output:
[
{
"date": "2019-01-31T12:34:56-08:00",
"high_temperature_fahrenheit": 50.9,
"low_temperature_fahrenheit": 30.2,
"wind": "North 20.5 mph"
},
{
"date": "2020-07-31T01:23:45-05:00",
"high_temperature_fahrenheit": 102.2,
"low_temperature_fahrenheit": 95,
"wind": "South East 4.97 mph"
}
]
As you can see, in the desired output, we'd like to standardize all the input temperatures into the same fahrenheit unit; we'd also like to do some translation such that the wind direction and wind speed are "massaged" into a more human readable string; lastly, we'd like to normalize the date text into RFC-3339 standard format.
Before we get into schema writing, let's first get familiar with omniparser CLI so that we can easily and incrementally test our schema writing.
Assuming you have the git repo cloned at ~/dev/jf-tech/omniparser/
, simply run this bash script:
~/dev/jf-tech/omniparser/cli.sh help
For this guide, we will need to use transform
command:
~/dev/jf-tech/omniparser/cli.sh transform --help
Now assuming your temporary working directory for this guide is: ~/Downloads/omniparser/guide/
,
let's create two files there, one for the input and one for the schema:
$ cd ~/Downloads/omniparser/guide/
$ touch input.csv
$ touch schema.json
Use any editor to cut & paste the CSV content from The Input into input.csv
, and
now run omniparser CLI from ~/Downloads/omniparser/guide/
:
$ ~/dev/jf-tech/omniparser/cli.sh transform -i input.csv -s schema.json
Error: unable to perform schema validation: EOF
Expected, given the schema.json
is still empty.
(Note, -i input.csv
is optional; you can use standard IO pipe to feed the input into cli.sh
:
e.g. cat input.csv | cli.sh transform -s schema.json
)
Now we're ready to go!
This is the common part of all omniparser schemas, the header parser_settings
:
{
"parser_settings": {
"version": "omni.2.1",
"file_format_type": "csv2"
}
}
(Note, "csv2"
has replaced the deprecated "csv"
schema, see more details in
CSV Schema in Depth.)
It's self-explanatory. Now let's run the CLI again:
$ ~/dev/jf-tech/omniparser/cli.sh transform -i input.csv -s schema.json
Error: schema 'schema.json' validation failed: (root): transform_declarations is required
A schema is to tell the parser how to do transform, which is missing from the current schema, let's add that...
transform_declarations
is a section in the schema with specific instructions to parser how to do
transformation. Let's add an empty transform_declarations
for now:
{
"parser_settings": {
"version": "omni.2.1",
"file_format_type": "csv2"
},
"transform_declarations": {}
}
Run the CLI we get another error:
$ ~/dev/jf-tech/omniparser/cli.sh transform -i input.csv -s schema.json
Error: schema 'schema.json' validation failed: transform_declarations: FINAL_OUTPUT is required
Let's add an empty FINAL_OUTPUT
in:
{
"parser_settings": {
"version": "omni.2.1",
"file_format_type": "csv2"
},
"transform_declarations": {
"FINAL_OUTPUT": {}
}
}
FINAL_OUTPUT
is the special name reserved for the transform template that will be used for
the output. Given the section is called transform_declarations
you might have guessed we can have
multiple templates defined in it. Each template can reference other templates. There must be one
and only one template called FINAL_OUTPUT
.
Run the CLI we get a new error:
$ ~/dev/jf-tech/omniparser/cli.sh transform -i input.csv -s schema.json
Error: schema 'schema.json' validation failed: (root): file_declaration is required
Ah...seems like a new section file_declaration
is needed...
While transform_declarations
contains instructions to the parser on how to transform the ingested
data into the desired output format, we still owe the parser the instructions how to ingest the input
stream, there comes file_declaration
. (Note not all input formats require a file_declaration
section, e.g. JSON and XML inputs need no file_declaration
in their schemas.)
For CSV, we need to define the following settings:
- What's the delimiter character, comma or something else?
- Is there a header row in the CSV input that defines the names of each column? If yes, do we want to check and verify all the column names, or can we ignore the header row; if no, what each column should be called during the data row ingestion and transformation?
For this guide example, the settings are:
- delimiter is
|
- Yes there is a header line with all the columns names defined, and we'd like to verify them.
- The actual data lines start at line 2. (Line number is 1-based.)
Let's add these:
{
"parser_settings": {
"version": "omni.2.1",
"file_format_type": "csv2"
},
"file_declaration": {
"delimiter": "|",
"records": [
{
"header": "^DATE|HIGH TEMP C|LOW TEMP F|WIND DIR|WIND SPEED KMH$",
"min": 1, "max": 1
},
{
"columns": [
{ "name": "DATE" },
{ "name": "HIGH TEMP C" },
{ "name": "LOW TEMP F" },
{ "name": "WIND DIR" },
{ "name": "WIND SPEED KMH" }
]
}
]
},
"transform_declarations": {
"FINAL_OUTPUT": {}
}
}
The records
part needs a little bit explanation:
- It says there are two different types of records in the file, the first one is the header record, and second one is the data record.
- For the header record, it has a
header
property, which contains a regexp pattern to match the actual header row in the input file. This is how we can match and make sure the header row is exactly what we expected. Note the header record also has two properties:min
,max
. They together specify there could be one and only one header row. Notemin
, if missing, is defaulted to 0, andmax
, if missing, is defaulted to-1
which means unlimited. - For the data record, it defines the columns with which we will extract data from each line in the
input file. Note the data record doesn't explicitly define
min
andmax
, thus we are expecting to see 0 or multiple (unlimited) lines of data rows from the input file.
Run the CLI again:
$ ~/dev/jf-tech/omniparser/cli.sh transform -i input.csv -s schema.json
[
null
]
Better, at least no more error message, but null
? That's no good.
It turns out we need to tell omniparser which of the two records is the one we want to ingest and transform. Kinda make sense, doesn't it? :)
To do so, simply putting "is_target": true
to the date record:
{
"parser_settings": {
"version": "omni.2.1",
"file_format_type": "csv2"
},
"file_declaration": {
"delimiter": "|",
"records": [
{
"header": "^DATE|HIGH TEMP C|LOW TEMP F|WIND DIR|WIND SPEED KMH$",
"min": 1, "max": 1
},
{
"is_target": true,
"columns": [
{ "name": "DATE" },
{ "name": "HIGH TEMP C" },
{ "name": "LOW TEMP F" },
{ "name": "WIND DIR" },
{ "name": "WIND SPEED KMH" }
]
}
]
},
"transform_declarations": {
"FINAL_OUTPUT": {}
}
}
Run cli, we got this:
$ ~/dev/jf-tech/omniparser/cli.sh transform -i input.csv -s schema.json
[
"01/31/2019 12:34:56-0800 10.5 30.2 N 33",
"07/31/2020 01:23:45-0500 39 95 SE 8"
]
Wow, we're getting something! Although not exactly matching what we want in The Desired Output, it's a big step forward!
Seems like the ingestion and transform directly copy each line from the input.csv
and output
as a JSON text string in the final output JSON array.
To understand what's behind the scene and build the foundation of understanding the inner working of the parser, we need to deviate from the schema writing for a moment...
Short answer: IDR.
IDR is an in-memory data representation format used by omniparser for the ingested data from all input formats. If you're interested in more technical details, check the IDR doc mentioned above.
CSV has a very simple IDR representation: each data line is mapped to an IDR tree, where each column is mapped to the tree's leaf nodes. So for our sample input csv here, the first data line would be represented by the following IDR:
|
+--"DATE"
| +--"01/31/2019 12:34:56-0800"
|
+--"HIGH TEMP C"
| +--" 10.5"
|
+--"LOW TEMP F"
| +--" 39"
|
+--"WIND DIR"
| +--" N"
|
+--"WIND SPEED KMH"
+--" 33"
You can imaginarily convert the IDR into XML which helps you understand the extensive use of XPath queries later in transformation:
<>
<DATE>01/31/2019 12:34:56-0800</DATE>
<HIGH TEMP C> 10.5</HIGH TEMP C>
<LOW TEMP F> 39</LOW TEMP F>
<WIND DIR> N</WIND DIR>
<WIND SPEED KMH> 33</WIND SPEED KMH>
</>
Note XML/XPath don't like element name containing spaces. While IDR doesn't care about names with
spaces, XPath queries used in transforms do care and will break. So we'd like to assign some
XPath friendly column names in our schema (replacing space with '_'
):
Let's make small modifications to our schema:
{
"parser_settings": {
"version": "omni.2.1",
"file_format_type": "csv2"
},
"file_declaration": {
"delimiter": "|",
"records": [
{
"header": "^DATE|HIGH TEMP C|LOW TEMP F|WIND DIR|WIND SPEED KMH$",
"min": 1, "max": 1
},
{
"is_target": true,
"columns": [
{ "name": "DATE" },
{ "name": "HIGH_TEMP_C" },
{ "name": "LOW_TEMP_F" },
{ "name": "WIND_DIR" },
{ "name": "WIND_SPEED_KMH" }
]
}
]
},
"transform_declarations": {
"FINAL_OUTPUT": {}
}
}
Rerun the CLI to ensure everything is still working. Now the IDR and its imaginary converted XML equivalent look like this:
<>
<DATE>01/31/2019 12:34:56-0800</DATE>
<HIGH_TEMP_C> 10.5</HIGH_TEMP_C>
<LOW_TEMP_F> 39</LOW_TEMP_F>
<WIND_DIR> N</WIND_DIR>
<WIND_SPEED_KMH> 33</WIND_SPEED_KMH>
</>
Remember this, and we'll move onto some real transformation!
Recall that we want to convert a data line such as
01/31/2019 12:34:56-0800| 10.5| 30.2| N| 33
into a JSON object output such as
{
"date": "2019-01-31T12:34:56-08:00",
"high_temperature_fahrenheit": 50.9,
"low_temperature_fahrenheit": 30.2,
"wind": "North 20.5 mph"
}
So let's define the object skeleton in the FINAL_OUTPUT
in schema.json
:
"transform_declarations": {
"FINAL_OUTPUT": { "object": {
"date": { "xpath": "DATE" },
"high_temperature_fahrenheit": { "xpath": "HIGH_TEMP_C" },
"low_temperature_fahrenheit": { "xpath": "LOW_TEMP_F" },
"wind": { "xpath": "WIND_DIR" }
}}
}
(Note for brevity, parser_settings
and file_declaration
are omitted above.)
We first defined FINAL_OUTPUT
is an object
, which contains a number of fields, such as date
,
high_temperature_fahrenheit
, etc.
For each of the field, we simply copy the data text as its value from an XPath query run on the IDR.
Remember for the first data line, its corresponding IDR (or the IDR's equivalent XML) looks like:
<>
<DATE>01/31/2019 12:34:56-0800</DATE>
<HIGH_TEMP_C> 10.5</HIGH_TEMP_C>
<LOW_TEMP_F> 39</LOW_TEMP_F>
<WIND_DIR> N</WIND_DIR>
<WIND_SPEED_KMH> 33</WIND_SPEED_KMH>
</>
Thus, an XPath query "xpath": "DATE"
on the root of the IDR would return 01/31/2019 12:34:56-0800
, which is
used as the value for the field date
. So on and so forth for all other fields.
Run the CLI, we have:
$ ~/dev/jf-tech/omniparser/cli.sh transform -i input.csv -s schema.json
[
{
"date": "01/31/2019 12:34:56-0800",
"high_temperature_fahrenheit": "10.5",
"low_temperature_fahrenheit": "30.2",
"wind": "N"
},
{
"date": "07/31/2020 01:23:45-0500",
"high_temperature_fahrenheit": "39",
"low_temperature_fahrenheit": "95",
"wind": "SE"
}
]
Much better! The result, at least structurally, looks like our desired output.
One small observation you might've taken is that any leading and trailing white spaces are stripped during the transformation. (Don't worry, there is a way to preserve leading/trailing white spaces if you wonder.)
Yes the result looks better, but still far from the ideal. Let's fix the issues one by one.
Recall we want the date
field in the desired output to be RFC-3339 compliant. We can use a parser
built-in function to achieve this:
"transform_declarations": {
"FINAL_OUTPUT": { "object": {
"date": { "custom_func": {
"name": "dateTimeToRFC3339",
"args": [
{ "xpath": "DATE" },
{ "const": "", "_comment": "input timezone" },
{ "const": "", "_comment": "output timezone" }
]
}},
"high_temperature_fahrenheit": { "xpath": "HIGH_TEMP_C" },
"low_temperature_fahrenheit": { "xpath": "LOW_TEMP_F" },
"wind": { "xpath": "WIND_DIR" }
}}
}
Run CLI we have:
$ ~/dev/jf-tech/omniparser/cli.sh transform -i input.csv -s schema.json
[
{
"date": "2019-01-31T12:34:56-08:00",
"high_temperature_fahrenheit": "10.5",
"low_temperature_fahrenheit": "30.2",
"wind": "N"
},
{
"date": "2020-07-31T01:23:45-05:00",
"high_temperature_fahrenheit": "39",
"low_temperature_fahrenheit": "95",
"wind": "SE"
}
]
Yes!!
So basically we changed the simple "date": { "xpath": "DATE" },
directive into a function call:
"date": { "custom_func": {
"name": "dateTimeToRFC3339",
"args": [
{ "xpath": "DATE" },
{ "const": "", "_comment": "input timezone" },
{ "const": "", "_comment": "output timezone" }
]
}},
These built-in functions are called custom_func
(and yes, programmatic users of omniparser have
the ability to add additional functions). To invoke a custom_func
, you need to provide the name
of the function, in this case dateTimeToRFC3339
, and a list of arguments the function requires.
(You can find the full references to all built-in custom_func
here.)
The first argument here is { "xpath": "DATE" },
basically providing the function the input datetime
string. The second argument dateTimeToRFC3339
requires specifies what time zone the input datetime
string is in. Since the datetime strings in the guide sample CSV already contain time zone offsets
(-0800
, -0500
), an empty string is supplied to the input time zone argument. The third argument is
the desired output time zone. If, say, we want to standardize all the date
fields in the output to
be in time zone of America/Los_Angeles
, we can specify it in the third argument, and the
custom_func
will perform the correct time zone shifts for us.
Note the HIGH_TEMP_C
in input data is in celsius, and the desired output calls for fahrenheit. Let's
fix that:
"transform_declarations": {
"FINAL_OUTPUT": { "object": {
"date": { "custom_func": {
"name": "dateTimeToRFC3339",
"args": [
{ "xpath": "DATE" },
{ "const": "", "_comment": "input timezone" },
{ "const": "", "_comment": "output timezone" }
]
}},
"high_temperature_fahrenheit": { "xpath": "HIGH_TEMP_C", "template": "template_c_to_f" },
"low_temperature_fahrenheit": { "xpath": "LOW_TEMP_F" },
"wind": { "xpath": "WIND_DIR" }
}},
"template_c_to_f": {
"custom_func": {
"name": "javascript",
"args": [
{ "const": "Math.floor((temp_c * 9 / 5 + 32) * 10) / 10" },
{ "const": "temp_c" }, { "xpath": ".", "type": "float" }
]
}
}
}
Here we introduce two new things: 1) template and 2) custom_func javascript
.
-
Template
Template is for schema snippet reuse or sometimes simply for readability. Imagine we have a schema in which we need to transform multiple fahrenheit data into celsius values, instead doing the math again and again, we can write a template (
template_c_to_f
) that can be reused.In our case, we replaced the simple direct-copying directive
"high_temperature_fahrenheit": { "xpath": "HIGH_TEMP_C" },
to template directive"high_temperature_fahrenheit": { "xpath": "HIGH_TEMP_C", "template": "template_c_to_f" },
Note we still have
"xpath": "HIGH_TEMP_C"
clause before"template": "template_c_to_f"
. What it says is to run/apply the template against the IDR sub-tree anchored (or rooted) on the node<HIGH_TEMP_C>
. -
custom_func
javascript
Now let's take a closer look at the template
template_c_to_f
:"template_c_to_f": { "custom_func": { "name": "javascript", "args": [ { "const": "Math.floor((temp_c * 9 / 5 + 32) * 10) / 10" }, { "const": "temp_c" }, { "xpath": ".", "type": "float" } ] } }
custom_func
javascript
takes a number of arguments: the first one is the actual script string, and all remaining arguments are to provide values for all the variables declared in the script string, in this particular case, only one variabletemp_c
. All remaining arguments come in pairs. The first in each pair always declares what variable the second in pair is about. And the second in each pair provides the actual value for the variable. In this example, we see variabletemp_c
should have a value based on the XPath query"."
and converted intofloat
type. Remember this template's invocation is anchored on the IDR node<HIGH_TEMP_C>
, thus XPath query"."
returns its text value"10.5"
, after which it was converted into numeric value10.5
before the math computation starts.Type conversion should be used only when needed. When we convert text
"10.5"
into numeric float value10.5
,"type": "float"
is used. However when the script is done, the result is already in float, there is no need to specify"type": "float"
for thecustom_func
directive.
Now let's run CLI:
$ ~/dev/jf-tech/omniparser/cli.sh transform -i input.csv -s schema.json
[
{
"date": "2019-01-31T12:34:56-08:00",
"high_temperature_fahrenheit": 50.9,
"low_temperature_fahrenheit": "30.2",
"wind": "N"
},
{
"date": "2020-07-31T01:23:45-05:00",
"high_temperature_fahrenheit": 102.2,
"low_temperature_fahrenheit": "95",
"wind": "SE"
}
]
Great! high_temperature_fahrenheit
looks nice now. But low_temperature_fahrenheit
still needs a
minor fix.
Observe the output above you can see low_temperature_fahrenheit
output value is a string, not a
numeric value. That should be an easy fix:
"transform_declarations": {
"FINAL_OUTPUT": { "object": {
"date": { "custom_func": {
"name": "dateTimeToRFC3339",
"args": [
{ "xpath": "DATE" },
{ "const": "", "_comment": "input timezone" },
{ "const": "", "_comment": "output timezone" }
]
}},
"high_temperature_fahrenheit": { "xpath": "HIGH_TEMP_C", "template": "template_c_to_f" },
"low_temperature_fahrenheit": { "xpath": "LOW_TEMP_F", "type": "float" },
"wind": { "xpath": "WIND_DIR" }
}},
"template_c_to_f": {
"custom_func": {
"name": "javascript",
"args": [
{ "const": "Math.floor((temp_c * 9 / 5 + 32) * 10) / 10" },
{ "const": "temp_c" }, { "xpath": ".", "type": "float" }
]
}
}
}
Basically changing "low_temperature_fahrenheit": { "xpath": "LOW_TEMP_F" }
to
"low_temperature_fahrenheit": { "xpath": "LOW_TEMP_F", "type": "float" }
.
Run CLI again, we have:
$ ~/dev/jf-tech/omniparser/cli.sh transform -i input.csv -s schema.json
[
{
"date": "2019-01-31T12:34:56-08:00",
"high_temperature_fahrenheit": 50.9,
"low_temperature_fahrenheit": 30.2,
"wind": "N"
},
{
"date": "2020-07-31T01:23:45-05:00",
"high_temperature_fahrenheit": 102.2,
"low_temperature_fahrenheit": 95,
"wind": "SE"
}
]
Almost there! The wind
field is a bit tricky to fix...
Recall from The Desired Output we want to have the field wind
some human
readable wind stat:
{
...
"wind": "North 20.5 mph"
},
Recall the first data line's IDR (XML equivalent) looks like:
<>
...
<WIND_DIR> N</WIND_DIR>
<WIND_SPEED_KMH> 33</WIND_SPEED_KMH>
</>
So wind
value needs to derive from two columns in the input CSV data line. Let's look at them one
by one.
-
Wind Direction
In the input, the wind direction is abbreviated (such as
"N"
,"E"
,"SW"
, etc). In the desired output we want it to be English. So we need some mapping, for which again we resort to the all mighty custom functionjavascript
:"wind_acronym_mapping": { "custom_func": { "name": "javascript", "args": [ { "const": "dir=='N'?'North':dir=='NE'?'North East':dir=='E'?'East':dir=='SE'?'South East':dir=='S'?'South':dir=='SW'?'South West':dir=='W'?'West':dir=='NW'?'North West':'Tornado'"}, { "const": "dir" }, { "xpath": "." } ] } }
A giant/long
? :
ternary operator infested javascript line maps wind direction abbreviations into English phrases. -
Wind Speed
In the input, the column
WIND_SPEED_KMH
unit is kilometers per hour (kmh) while the desired output calls for miles per hour (mph). Let's do the conversion:Math.floor(kmh * 0.621371 * 100) / 100
(Several uses of
Math.floor(...*100/100)
throughout this page is to limit the number of decimal places to be more human readable.)
Put 1) and 2) together, we can have the new transform schema look like this:
"transform_declarations": {
"FINAL_OUTPUT": { "object": {
"date": { "custom_func": {
"name": "dateTimeToRFC3339",
"args": [
{ "xpath": "DATE" },
{ "const": "", "_comment": "input timezone" },
{ "const": "", "_comment": "output timezone" }
]
}},
"high_temperature_fahrenheit": { "xpath": "HIGH_TEMP_C", "template": "template_c_to_f" },
"low_temperature_fahrenheit": { "xpath": "LOW_TEMP_F", "type": "float" },
"wind": { "custom_func": {
"name": "javascript",
"args": [
{ "const": "win_dir + ' ' + Math.floor(kmh * 0.621371 * 100) / 100 + ' mph'" },
{ "const": "win_dir" }, { "xpath": "WIND_DIR", "template": "wind_acronym_mapping" },
{ "const": "kmh" }, { "xpath": "WIND_SPEED_KMH", "type": "float" }
]
}}
}},
"template_c_to_f": {
"custom_func": {
"name": "javascript",
"args": [
{ "const": "Math.floor((temp_c * 9 / 5 + 32) * 10) / 10" },
{ "const": "temp_c" }, { "xpath": ".", "type": "float" }
]
}
},
"wind_acronym_mapping": {
"custom_func": {
"name": "javascript",
"args": [
{ "const": "dir=='N'?'North':dir=='NE'?'North East':dir=='E'?'East':dir=='SE'?'South East':dir=='S'?'South':dir=='SW'?'South West':dir=='W'?'West':dir=='NW'?'North West':'Tornado'"},
{ "const": "dir" }, { "xpath": "." }
]
}
}
}
Run CLI one last time, we have:
$ ~/dev/jf-tech/omniparser/cli.sh transform -i input.csv -s schema.json
[
{
"date": "2019-01-31T12:34:56-08:00",
"high_temperature_fahrenheit": 50.9,
"low_temperature_fahrenheit": 30.2,
"wind": "North 20.5 mph"
},
{
"date": "2020-07-31T01:23:45-05:00",
"high_temperature_fahrenheit": 102.2,
"low_temperature_fahrenheit": 95,
"wind": "South East 4.97 mph"
}
]
Perfectly match The Desired Output!
While it's strongly recommended to use CLI for schema development, it is our eventual goal to use omniparser and schemas programmatically to enable high speed / high volume processing. Below is the code snippet of showing how to achieve this:
schema, err := omniparser.NewSchema("your schema name", strings.NewReader("your schema content"))
if err != nil { ... }
transform, err := schema.NewTransform("your input name", strings.NewReader("your input content"), &transformctx.Ctx{})
if err != nil { ... }
for {
output, err := transform.Read()
if err == io.EOF {
break
}
if err != nil { ... }
// output contains a []byte of the ingested and transformed record.
// Also transform.RawRecord() gives you access to the raw record.
fmt.Println(transform.RawRecord().Checksum())
}
DATE|HIGH TEMP C|LOW TEMP F|WIND DIR|WIND SPEED KMH
01/31/2019 12:34:56-0800| 10.5| 30.2| N| 33
07/31/2020 01:23:45-0500| 39| 95| SE| 8
{
"parser_settings": {
"version": "omni.2.1",
"file_format_type": "csv2"
},
"file_declaration": {
"delimiter": "|",
"records": [
{
"header": "^DATE|HIGH TEMP C|LOW TEMP F|WIND DIR|WIND SPEED KMH$",
"min": 1, "max": 1
},
{
"is_target": true,
"columns": [
{ "name": "DATE" },
{ "name": "HIGH_TEMP_C" },
{ "name": "LOW_TEMP_F" },
{ "name": "WIND_DIR" },
{ "name": "WIND_SPEED_KMH" }
]
}
]
},
"transform_declarations": {
"FINAL_OUTPUT": { "object": {
"date": { "custom_func": {
"name": "dateTimeToRFC3339",
"args": [
{ "xpath": "DATE" },
{ "const": "", "_comment": "input timezone" },
{ "const": "", "_comment": "output timezone" }
]
}},
"high_temperature_fahrenheit": { "xpath": "HIGH_TEMP_C", "template": "template_c_to_f" },
"low_temperature_fahrenheit": { "xpath": "LOW_TEMP_F", "type": "float" },
"wind": { "custom_func": {
"name": "javascript",
"args": [
{ "const": "win_dir + ' ' + Math.floor(kmh * 0.621371 * 100) / 100 + ' mph'" },
{ "const": "win_dir" }, { "xpath": "WIND_DIR", "template": "wind_acronym_mapping" },
{ "const": "kmh" }, { "xpath": "WIND_SPEED_KMH", "type": "float" }
]
}}
}},
"template_c_to_f": {
"custom_func": {
"name": "javascript",
"args": [
{ "const": "Math.floor((temp_c * 9 / 5 + 32) * 10) / 10" },
{ "const": "temp_c" }, { "xpath": ".", "type": "float" }
]
}
},
"wind_acronym_mapping": {
"custom_func": {
"name": "javascript",
"args": [
{ "const": "dir=='N'?'North':dir=='NE'?'North East':dir=='E'?'East':dir=='SE'?'South East':dir=='S'?'South':dir=='SW'?'South West':dir=='W'?'West':dir=='NW'?'North West':'Tornado'"},
{ "const": "dir" }, { "xpath": "." }
]
}
}
}
}
schema, err := omniparser.NewSchema("your schema name", strings.NewReader("your schema content"))
if err != nil { ... }
transform, err := schema.NewTransform("your input name", strings.NewReader("your input content"), &transformctx.Ctx{})
if err != nil { ... }
for {
output, err := transform.Read()
if err == io.EOF {
break
}
if err != nil { ... }
// output contains a []byte of the ingested and transformed record.
}
[
{
"date": "2019-01-31T12:34:56-08:00",
"high_temperature_fahrenheit": 50.9,
"low_temperature_fahrenheit": 30.2,
"wind": "North 20.5 mph"
},
{
"date": "2020-07-31T01:23:45-05:00",
"high_temperature_fahrenheit": 102.2,
"low_temperature_fahrenheit": 95,
"wind": "South East 4.97 mph"
}
]