jsonvectorizer Learning nested JSON

Currently when running, all that is returned is the schema from "root", even though I have over 100,000 documents that have many nested attributes.

Currently in vectorizers there are the following:

        basevectorizer.py
	boolvectorizer.py
	numbervectorizer.py	
	stringvectorizer.py
	timestampvectorizer.py

Is there something I'm not quite understanding when it comes to "learning" deeper JSON than beyond 'root'?

Apr 13 '19 19:04 FullPint

The code should automatically learn the schema of nested documents. There was a bug in the sample code that I just fixed, that might have caused the issue. Use vectorizer.extend(docs) for learning the schema, where docs is a list of JSON documents, or use vectorizer.extend([doc]) when learning the schema incrementally.

Apr 17 '19 13:04 arsarabi

Hello arsarabi,

Thank you for making your code available.

I've also had no luck learning nested attributes. Do I need to define a vectorizer of type "object" to be able to learn nested JSON objects?

Suppose I have a set of documents that match the following schema:

{
  "nestedobject": {
    "stringattr1": "some string",
    "numberattr1": 42,
    "stringattr2": "another string"
  },
  "stringattr3": "a third string",
  "booleanattr1": true
}

...do I need to define additional vectorizers beyond those you provide in the sample code?

If I (only) use the vectorizers provided in the sample code, the only learned features are:

0: root has "booleanattr1"
1: root has "stringattr3"
2: root has "nestedobject"

Thank you in advance for answering this (very basic) usage question :).

Apr 26 '22 00:04 jvmk

Hello,

It has been a while since I worked on this but I believe it should work with nested JSON out of the box following the usage steps. Could you provide sample code that recreates the issue? Thanks!

May 14 '22 17:05 arsarabi