rss-bridge icon indicating copy to clipboard operation
rss-bridge copied to clipboard

Add JSON extraction embedded in HTML script element

Open hkcomori opened this issue 1 year ago • 6 comments

I want to extract JSON embedded in HTML script elements for processing by JSON dotpath. So I have added a format that outputs only bare content. Barejson is a term I coined because pure format names could not explain the behavior.. So if you have a better idea, I would like to adopt it.

This format can output only one item, so if more and less than one is found, an error will occur.

This is triggered by the following discussion: https://github.com/FreshRSS/FreshRSS/discussions/6406

hkcomori avatar May 14 '24 01:05 hkcomori

sorry i dont understand the use case here

maybe show example usage

dvikan avatar May 14 '24 02:05 dvikan

I want to use JSON dotted path to get information from JSON embedded as a script element, such as the following on this page. It contains information on articles that should be RSS. It can be read from HTML, but some information are only in JSON.

<script id="__NEXT_DATA__" type="application/json">{"props":{"pageProps":{"workId":"018d6a5c-b9f2-77db-9191-e7cc6fbfdce2", ... }</script>

JSON must be separate from HTML because JSON dotted paths are not HTML readable. For this purpose, this PR feature extracts JSON, and the JSON dotted path processes the results.

hkcomori avatar May 14 '24 05:05 hkcomori

XPathBridge example:

Enter web page URL: https://comic-walker.com/detail/KC_003160_S?episodeType=latest Item selector: //script[@id="__NEXT_DATA__"] Item title selector: "JSON" Item description selector: ./text() Use raw item description: true

hkcomori avatar May 14 '24 05:05 hkcomori

Is it better to create bridges to extract RSS from embedded json instead of such format for intermediate files?

hkcomori avatar May 15 '24 02:05 hkcomori

  1. are you aware that there already exists a JsonFormat?

  2. Have you tested this PR and it does what you need?

dvikan avatar May 15 '24 14:05 dvikan

  1. are you aware that there already exists a JsonFormat?

Of course, I first tried JsonFormat. I expected the following results:

{
    ...
    "content": {
        "key": "value"
    }
}

But in fact, the content was converted to a string and raw Json could not be extracted:

{
    ...
    "content": "{\"key\": \"value\"}"
}
  1. Have you tested this PR and it does what you need?

Yes. I confirmed that this result is raw json content and JSON dotted path can processes it.

hkcomori avatar May 15 '24 15:05 hkcomori