Skip to content
Open

V2 #14

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
73d745f
WIP: new v2 prototype
justsml Jan 14, 2021
a472e8c
WIP: v2 progress, need to rm extra "types" nesting
justsml Jan 14, 2021
af6f0f8
WIP: v2 prototype, with nested types
justsml Jan 16, 2021
0b03b56
100% coverage, w/ recursive sub-types detection
justsml Jan 16, 2021
d929a57
v2 version update
justsml Jan 16, 2021
dcda8b1
v2-rc1 release candidate
justsml Jan 16, 2021
ebcecc9
update lock file
justsml Jan 16, 2021
88df437
update actions/tests
justsml Jan 16, 2021
6db606b
update actions/tests
justsml Jan 16, 2021
cca34d9
update deps
justsml Jan 16, 2021
882405a
update action target os
justsml Jan 16, 2021
5e87175
cleanup action comments
justsml Jan 16, 2021
6d92221
Merge branch 'master' into v2
justsml Jan 16, 2021
31c5639
renamed api
justsml Jan 20, 2021
705eb37
Merge branch 'v2' of github.com:justsml/schema-detector into v2
justsml Jan 20, 2021
8b76401
addedd generated types
justsml Jan 20, 2021
1455040
pre-typescript conversion
justsml Jan 20, 2021
3c501e2
beginning typescript conversion
justsml Jan 20, 2021
8e0a747
WIP: only 50 type errors remaining
justsml Jan 20, 2021
c52935c
progress
justsml Jan 20, 2021
7f17dce
progress - 12 errors
justsml Jan 20, 2021
64ff4ff
progress - 3 errors
justsml Jan 20, 2021
332f2b7
WIP: almost
justsml Jan 20, 2021
a509f9b
Much progress
justsml Jan 20, 2021
f9c9bd1
WORKING BUILD & TESTS ALL PASS!!!
justsml Jan 20, 2021
995339f
added new result types, tests, helper methods
justsml Jan 21, 2021
3f19c5c
made results more clear
justsml Jan 22, 2021
9c5db45
Adding webpack & new types: discriminating unions
justsml Jan 22, 2021
53a4629
WIP: fixing formatting & splitting some types
justsml Jan 23, 2021
5da44d0
tests not passing, about to fix unique & enum
justsml Jan 23, 2021
03aa5fa
unique check fixed
justsml Jan 23, 2021
d06788c
tests pass 98% coverage
justsml Jan 23, 2021
b373d67
All green! Great success!!!!
justsml Jan 23, 2021
73fe945
limit node version < 14
justsml Jan 23, 2021
242c4a7
build and tests passing
justsml Jan 24, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 17 additions & 39 deletions .github/workflows/test-automation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,45 +4,23 @@ on: [push]

jobs:
tests:
runs-on: ${{ matrix.os }}
# runs-on: ubuntu-latest

strategy:
matrix:
os: [ubuntu-latest, macos-latest]
# node-version: [12.14.1]
node-version: [12.x, 13.x]
env:
CI: true

steps:
- uses: actions/checkout@v2
- name: Use Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v1
with:
node-version: ${{ matrix.node-version }}
- run: |
npm ci
npm test
runs-on: ubuntu-20.04

coverage:
needs: tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Test Code Coverage
uses: actions/setup-node@v1
with:
node-version: 12
- run: |
npm ci
npm test
- uses: codecov/codecov-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }} #required
# file: ./coverage/coverage.xml #optional
flags: unittests #optional
name: schema-analyzer #optional
yml: ./codecov.yml #optional
fail_ci_if_error: true #optional (default = false)

env:
CI: true
- uses: actions/checkout@v2
- name: Use Node.js 14
uses: actions/setup-node@v1
with:
node-version: 14.15.3
- run: |
sudo apt-get update && sudo apt-get install zopfli brotli
npm ci
npm test
- uses: codecov/codecov-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }} #required
flags: unittests #optional
name: schema-analyzer #optional
12 changes: 12 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@ lerna-debug.log*

.DS_Store

build

.rollup.cache
.cache/

# Diagnostic reports (https://nodejs.org/api/report.html)
report.[0-9]*.[0-9]*.[0-9]*.[0-9]*.json

Expand Down Expand Up @@ -107,3 +112,10 @@ typings/

# Stores VSCode versions used for testing VSCode extensions
.vscode-test

# yarn v2
.yarn/cache
.yarn/unplugged
.yarn/build-state.yml
.yarn/install-state.gz
.pnp.*
6 changes: 2 additions & 4 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
{
"type": "node",
"request": "launch",
"name": "Launch via NPM",
"name": "npm test:debug",
"runtimeExecutable": "npm",
"runtimeArgs": [
"run-script",
Expand Down Expand Up @@ -49,9 +49,7 @@
"name": "Jest Filtered...",
"program": "${workspaceFolder}/node_modules/.bin/jest",
"args": [
"--runInBand",
"--testNamePattern",
"inline csv"
"--runInBand"
],
"console": "integratedTerminal",
"internalConsoleOptions": "neverOpen",
Expand Down
72 changes: 34 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ Schema **Analyzer** is the core library behind Dan's [Schema **Generator**](http
The primary goal is to support any input JSON/CSV and infer as much as possible. More data will generally yield better results.

- [x] Heuristic type analysis for arrays of objects.
- [x] Browser-based (local, no server necessary)
- [x] Nested data structure & multi-table relational output.
- [x] Browser-based (local, no server used)
- [x] Automatic type detection:
- [x] ID - Identifier column, by name and unique Integer check (detects BigInteger)
- [x] ObjectId (MongoDB's 96 bit/12 Byte ID. 32bit timestamp + 24bit MachineID + 16bit ProcessID + 24bit Counter)
Expand All @@ -44,47 +45,57 @@ The primary goal is to support any input JSON/CSV and infer as much as possible.
- [x] Quantify # of unique values per column
- [x] Identify `enum` Fields w/ Values
- [x] Identify `Not Null` fields
- [ ] Nested data structure & multi-table relational output.
<!-- - [ ] _Un-de-normalize_ JSON into flat typed objects. -->
- [x] _Normalize_ structured JSON into flat typed objects.

### Getting Started

```js
```bash
npm install schema-analyzer
```

```js
import { schemaBuilder } from 'schema-builder'
```ts
import { schemaAnalyzer } from 'schema-analyzer'

schemaBuilder(schemaName: String, data: Array<Object>): TypeSummary
schemaAnalyzer(schemaName: string, data: any[]): TypeSummary
```

### Preview Analysis Results

> What does this library's analysis look like?

It consists of 3 key top-level properties:
It consists of a few top-level properties:

- `totalRows` - # of rows analyzed.
- `fields: FieldTypeSummary` - a map of field names with all detected types ([includes meta-data](#aggregatesummary) for each type detected, with possible overlaps. e.g. an `Email` is also a `String`, `"42"` is a String and Number)
- `nestedTypes: { [typeAlias: string]: TypeSummary }` - a nested dictionary of sub-types
- `totalRows` - # of rows analyzed.

#### Review the raw results below

Details about each field can be found below.
#### Example Dataset

| id | name | role | email | createdAt | accountConfirmed |
|----|-----------------|-----------|------------------------------|------------|------------------|
| 1 | Eve | poweruser | `[email protected]` | 01/20/2020 | undefined |
| 2 | Alice | user | `[email protected]` | 02/02/2020 | true |
| 3 | Bob | user | `[email protected]` | 12/31/2019 | true |
| 4 | Elliot Alderson | admin | `[email protected]` | 01/01/2001 | false |
| 5 | Sam Sepiol | admin | `[email protected]` | 9/9/99 | true |


#### Analysis Results

```json
{
"schemaName": "sampleUsers",
"totalRows": 5,
"fields": {
"id": {
"identity": true,
"types": {
"Number": {
"rank": 8,
"count": 5,
"value": { "min": 1, "mean": 3, "max": 5, "p25": 2, "p33": 2, "p50": 3, "p66": 4, "p75": 4, "p99": 5 }
},
"String": {
"rank": 12,
"count": 5,
"length": { "min": 1, "mean": 1, "max": 1, "p25": 1, "p33": 1, "p50": 1, "p66": 1, "p75": 1, "p99": 1 }
}
Expand All @@ -93,7 +104,6 @@ Details about each field can be found below.
"name": {
"types": {
"String": {
"rank": 12,
"count": 5,
"length": { "min": 3, "mean": 7.2, "max": 15, "p25": 3, "p33": 3, "p50": 5, "p66": 10, "p75": 10, "p99": 15 }
}
Expand All @@ -102,7 +112,6 @@ Details about each field can be found below.
"role": {
"types": {
"String": {
"rank": 12,
"count": 5,
"length": { "min": 4, "mean": 5.4, "max": 9, "p25": 4, "p33": 4, "p50": 5, "p66": 5, "p75": 5, "p99": 9 }
}
Expand All @@ -111,7 +120,6 @@ Details about each field can be found below.
"email": {
"types": {
"Email": {
"rank": 11,
"count": 5,
"length": { "min": 15, "mean": 19.4, "max": 26, "p25": 15, "p33": 15, "p50": 18, "p66": 23, "p75": 23, "p99": 26 }
}
Expand All @@ -120,12 +128,10 @@ Details about each field can be found below.
"createdAt": {
"types": {
"Date": {
"rank": 4,
"count": 4,
"value": { "min": "2001-01-01T00:00:00.000Z", "mean": "2015-04-14T18:00:00.000Z", "max": "2020-02-02T00:00:00.000Z", "p25": "2020-02-02T00:00:00.000Z", "p33": "2020-02-02T00:00:00.000Z", "p50": "2019-12-31T00:00:00.000Z", "p66": "2019-12-31T00:00:00.000Z", "p75": "2001-01-01T00:00:00.000Z", "p99": "2001-01-01T00:00:00.000Z" }
},
"String": {
"rank": 12,
"count": 1,
"length": { "min": 6, "mean": 6, "max": 6, "p25": 6, "p33": 6, "p50": 6, "p66": 6, "p75": 6, "p99": 6 }
}
Expand All @@ -134,35 +140,22 @@ Details about each field can be found below.
"accountConfirmed": {
"types": {
"Unknown": {
"rank": -1,
"count": 1
},
"String": {
"rank": 12,
"count": 1,
"length": { "min": 9, "mean": 9, "max": 9, "p25": 9, "p33": 9, "p50": 9, "p66": 9, "p75": 9, "p99": 9 }
},
"Boolean": {
"rank": 3,
"count": 4
}
}
}
}
},
"nestedTypes": {}
}
```

#### Sample input dataset for the example results above

| id | name | role | email | createdAt | accountConfirmed |
|----|-----------------|-----------|------------------------------|------------|------------------|
| 1 | Eve | poweruser | `[email protected]` | 01/20/2020 | undefined |
| 2 | Alice | user | `[email protected]` | 02/02/2020 | true |
| 3 | Bob | user | `[email protected]` | 12/31/2019 | true |
| 4 | Elliot Alderson | admin | `[email protected]` | 01/01/2001 | false |
| 5 | Sam Sepiol | admin | `[email protected]` | 9/9/99 | true |



#### `AggregateSummary`

Expand All @@ -175,7 +168,7 @@ Numeric and String types include a summary of the observed field sizes:
- `min` the minimum number or string length
- `max` the maximum number or string length
- `mean` the average number or string length
- `percentiles[25th, 33th, 50th, 66th, 75th, 99th]` values from the `Nth` percentile number or string length
- `percentiles[25th, 33th, 50th, 66th, 75th, 99th]` values from the `Nth` percentile (number or string length)

Percentile is based on input data, as-is with out sorting.

Expand All @@ -185,7 +178,6 @@ Range data for the `length` of a `String` field type:

```js
{
"rank": 11,
"count": 5,
"length": { "min": 15, "mean": 19.4, "max": 26, "p25": 15, "p33": 15, "p50": 18, "p66": 23, "p75": 23, "p99": 26 }
}
Expand All @@ -197,7 +189,6 @@ Range data for a `Date` fields `value`:

```js
{
"rank": 4,
"count": 4,
"value": { "min": "2001-01-01T00:00:00.000Z", "mean": "2015-04-14T18:00:00.000Z", "max": "2020-02-02T00:00:00.000Z", "p25": "2020-02-02T00:00:00.000Z", "p33": "2020-02-02T00:00:00.000Z", "p50": "2019-12-31T00:00:00.000Z", "p66": "2019-12-31T00:00:00.000Z", "p75": "2001-01-01T00:00:00.000Z", "p99": "2001-01-01T00:00:00.000Z" }
}
Expand All @@ -211,10 +202,10 @@ We recommend you provide at least 100+ rows. Accuracy increases greatly with 1,0
The following features require a certain minimum # of records:

- Enumeration detection.
- 100+ Rows Required.
- Requires at least 100 rows, with 10 or fewer unique values.
- Number of unique values must not exceed 20 or 5% of the total number of records. (100 records will identify as Enum w/ 5 values. Up to 20 are possible given 400 or 1,000+.)
- `Not Null` detection.
- where rowCount === field count
- where `emptyRowCount < (total rows - threshold)`

### Full List of Detected Types

Expand All @@ -233,3 +224,8 @@ The following features require a certain minimum # of records:
- `Object`
- `Null`

## Similar/Alternative Projects

- https://github.com/quicktype/quicktype
- https://github.com/SweetIQ/schemats
- https://github.com/vojtechhabarta/typescript-generator
Loading