ha-multiscrape icon indicating copy to clipboard operation
ha-multiscrape copied to clipboard

Adding support for header mappings in form submit.

Open jeremicmilan opened this issue 2 years ago • 13 comments

Header mappings are a feature to configure the headers you want to be forwarded from scraping the form-submit page to scraping the main page for sensor data. A common use case is to populate the X-Login-Token header which is the result of the login.

Example:

multiscrape:
  - name: AirVisual
    resource: 'https://website-api.airvisual.com/v1/users/<user_id>/devices/<device_id>?units.system=metric&AQI=US&language=en'
    scan_interval: 10
    log_response: true
    form_submit:
      submit_once: True
      resource: 'https://website-api.airvisual.com/v1/auth/signin/by/email'
      input:
        email: '<email>'
        password: '<password>'
      header_mappings:
        - name: X-Login-Token
          value_template: '{{ (value | from_json).loginToken }}'
    sensor:
      - name: AirVisual Outdoor AQI
        value_template: '{{ (value | from_json).current.aqi.value }}'
        unit_of_measurement: 'AQI US'
      - name: AirVisual Outdoor PM1 AQI
        value_template: '{{ (value | from_json).current.pm1.aqi }}'
        unit_of_measurement: 'AQI US'
      - name: AirVisual Outdoor PM2.5 AQI
        value_template: '{{ (value | from_json).current.pm25.aqi }}'
        unit_of_measurement: 'AQI US'
      - name: AirVisual Outdoor PM10 AQI
        value_template: '{{ (value | from_json).current.pm10.aqi }}'
        unit_of_measurement: 'AQI US'

      - name: AirVisual Outdoor PM1
        value_template: '{{ (value | from_json).current.pm1.value }}'
        unit_of_measurement: 'µg/m³'
      - name: AirVisual Outdoor PM2.5
        value_template: '{{ (value | from_json).current.pm25.value }}'
        unit_of_measurement: 'µg/m³'
      - name: AirVisual Outdoor PM10
        value_template: '{{ (value | from_json).current.pm10.value }}'
        unit_of_measurement: 'µg/m³'
      - name: AirVisual Outdoor ParticleCount
        value_template: '{{ (value | from_json).current.pc.value }}'
        unit_of_measurement: 'pc/L'

      - name: AirVisual Outdoor Pressure
        value_template: '{{ (value | from_json).current.pressure.value }}'
        unit_of_measurement: 'mbar'
      - name: AirVisual Outdoor Humidity
        value_template: '{{ (value | from_json).current.humidity.value }}'
        unit_of_measurement: '%'
      - name: AirVisual Outdoor Temperature
        value_template: '{{ (value | from_json).current.temperature.value }}'
        unit_of_measurement: '°C'

Log into https://dashboard.iqair.com/personal/devices, select the device to get the <device_id> in the URL. After that analyze network traffic and find the name starting with <device_id>. That will contain the entire path in the example including <user_id> (there's probably an easier way to get <user_id>, but this works),

jeremicmilan avatar Feb 03 '24 19:02 jeremicmilan

Thank you for this extensive contribution! I really like your solution and it seems to solve many open issues! I'm currently working on another custom component of mine (https://github.com/danieldotnl/ha-measureit). I hope that will be finished soon and then I'll give multiscrape some more love and attention again. I have a very large change prepared in the test-service branch which I hope to merge soon. It introduces two services for scraping and retrieving page content, which should it make a lot easier for people to figure out their configuration. However, merging is not trivial and I would like to ask you if you can rebase your PR on that branch?

Thanks again and let me know if you have questions!

danieldotnl avatar Feb 09 '24 09:02 danieldotnl

Thank you for making the integration in the first place! You're welcome, I'm glad to contribute.

No worries, I can wait. The scraping works for me, it's just that I have to refresh it from time to time.

jeremicmilan avatar Feb 09 '24 14:02 jeremicmilan

@danieldotnl, what is the rough estimate? Is it days/weeks/months? If it's months, I'll probably go ahead and install the integration from my branch to avoid expiring tokens.

jeremicmilan avatar Feb 25 '24 10:02 jeremicmilan

Weeks!

danieldotnl avatar Feb 26 '24 17:02 danieldotnl

Finally! I merged the test-service branch! Could you please resolve the conflicts in your PR? 😅

danieldotnl avatar Mar 19 '24 16:03 danieldotnl

Thanks! Little bit short on time at the moment. I'll try it this weekend or the next one.

jeremicmilan avatar Mar 20 '24 12:03 jeremicmilan

Any progress? I don't want your tokens to expire ;-)

danieldotnl avatar Apr 05 '24 20:04 danieldotnl

Sorry, totally forgot about this and actually remembered it two days ago. I'll do it now/soon.

jeremicmilan avatar Apr 06 '24 04:04 jeremicmilan

I messed up and did not create a dev branch on my fork (first PR on GitHub and kind of thought that my fork is considered as a dev branch). There might have been a way to do this, but it is what it is. 😄 I'll try to reactivate this PR, and if not possible, I'll create a new one.

jeremicmilan avatar Apr 06 '24 07:04 jeremicmilan

@danieldotnl , merge complete. Please review.

jeremicmilan avatar Apr 06 '24 09:04 jeremicmilan

Thanks a lot! That wasn't the easiest merge 😊 I'll look into it soon. One questing already: how can I test it as I don't have an airvisual account?

danieldotnl avatar Apr 06 '24 10:04 danieldotnl

Yep, it wasn't an easy merge. On the other hand, it was not that complex, only tedious (making sure something is not lost in the transition).

Regarding testing, you should be able to use it with any website operating with username and pass only (no two factor and more complicated encryption). On the other hand, you should be able to create an AirVisual account. You just will not be able to scrape data from a device, but you can change that to scrape a random thing from a slightly different URL. Here is an example:

multiscrape:
  - name: AirVisual
    resource: 'https://website-api.airvisual.com/v1/users/<user_id>/devices'
    scan_interval: 10
    log_response: true
    form_submit:
      submit_once: True
      resource: 'https://website-api.airvisual.com/v1/auth/signin/by/email'
      input:
        email: '<username>'
        password: '<password>'
      header_mappings:
        - name: X-Login-Token
          value_template: '{{ (value | from_json).loginToken }}'
    sensor:
      - name: test_header_mapping
        value_template: '{{ 10 }}'

Get user_id with the instructions from the PR description. If you remove header_mappings from the config (or input wrong password), you should get the 401 error (as headers were not populated in the main scrape).

To have better testing of the feature, maybe we can ask folks that filed issues linked in this PR (that this feature should resolve) to test it?

jeremicmilan avatar Apr 06 '24 13:04 jeremicmilan

Yep, it wasn't an easy merge. On the other hand, it was not that complex, only tedious (making sure something is not lost in the transition).

Nice job!

Regarding testing, you should be able to use it with any website operating with username and pass only (no two factor and more complicated encryption).

But most sites with a username/pass don't require this right? I can check if it forwards headers, but I still like to verify if the login works with this change while it didn't work without.

On the other hand, you should be able to create an AirVisual account. You just will not be able to scrape data from a device, but you can change that to scrape a random thing from a slightly different URL. Here is an example:

multiscrape:
  - name: AirVisual
    resource: 'https://website-api.airvisual.com/v1/users/<user_id>/devices'
    scan_interval: 10
    log_response: true
    form_submit:
      submit_once: True
      resource: 'https://website-api.airvisual.com/v1/auth/signin/by/email'
      input:
        email: '<username>'
        password: '<password>'
      header_mappings:
        - name: X-Login-Token
          value_template: '{{ (value | from_json).loginToken }}'
    sensor:
      - name: test_header_mapping
        value_template: '{{ 10 }}'

Get user_id with the instructions from the PR description. If you remove header_mappings from the config (or input wrong password), you should get the 401 error (as headers were not populated in the main scrape).

I made a throwaway account, but I don't see how I can get the userid since I don't have a device to add... Ideas?

To have better testing of the feature, maybe we can ask folks that filed issues linked in this PR (that this feature should resolve) to test it?

Yes, I'll create a pre-release and we'll ask the people who created token related issues to try it out.

danieldotnl avatar Apr 08 '24 19:04 danieldotnl

But most sites with a username/pass don't require this right? I can check if it forwards headers, but I still like to verify if the login works with this change while it didn't work without.

I thought that most of them do actually give you a token you use for accessing the website. Otherwise, how does the website know that a specific request is properly authenticated? Spoofing becomes much easier otherwise without the token. On the other hand, you do not want to transmit username/password with every request, as it opens you up to many other vulnerabilities. I'm more familiar with bearer tokens (with the Authorization header), but X-Login-Token seems to serve the same purpose. And here is Copilot's summary: image

I made a throwaway account, but I don't see how I can get the userid since I don't have a device to add... Ideas?

Go to https://dashboard.iqair.com/personal/devices and in the Network tab of the Developer Tools, search for account?units.system=metric&AQI=US&language=en. The first field id in the response is the userid.

image

jeremicmilan avatar Apr 10 '24 05:04 jeremicmilan

@danieldotnl, any progress on the review? Can I help you somehow?

jeremicmilan avatar Apr 26 '24 17:04 jeremicmilan

@danieldotnl, any progress on the review? Can I help you somehow?

jeremicmilan avatar Apr 26 '24 17:04 jeremicmilan

Thanks @jeremicmila, I know it takes a long time! I had it working a week ago with airvisual, so I understand now how it's working. Now I need to find time for the review. Going on a short vacation now, and really hope to be able to finish soon after.

danieldotnl avatar Apr 26 '24 18:04 danieldotnl

I have been thinking about this and wanted to share my thoughts with you.

I think this feature should be about two-step requests where we need to pass something back to the server that we received as response to the first request. This could be:

  • a token in a header
  • a token on a response page (your case)
  • a cookie

I was wondering if we could solve all three of them, because I believe the other two cases are actually more common than yours. What about introducing a concept of variables which you can use in templates (e.g. in headers/resource url) later on?

Maybe your example would then become something like this:

multiscrape:
  - name: AirVisual
    resource: 'https://website-api.airvisual.com/v1/users/<user_id>/devices/<device_id>?units.system=metric&AQI=US&language=en'
    headers:
      X-Login-Token: {{ token }}
    scan_interval: 10
    log_response: true
    form_submit:
      submit_once: True
      resource: 'https://website-api.airvisual.com/v1/auth/signin/by/email'
      input:
        email: '<email>'
        password: '<password>'
      variables:
        - name: token
          value_template: '{{ (value | from_json).loginToken }}'
    sensor:

Later on the other cases could then be implemented.

Let me know what you think!

danieldotnl avatar May 06 '24 19:05 danieldotnl

I like the idea. Also, it should not be too hard to implement. Maybe we should be more specific when using variables to avoid naming conflicts? Something like X-Login-Token: {{ form_submit_variables.token }}?

I only wish you replied couple of days ago, as I've had some extra time during the holiday season. Now it's going to take a while before I find a slot to work on this, but I'll get it done. :)

jeremicmilan avatar May 07 '24 10:05 jeremicmilan

Hello. I have the same problem, I need to do a Get on a page to get the bearer token. However, I cannot use the token in a header as a template.

- name: Preco do gas
  resource: https://www.precodogas.com.br/fazer-pedido-ads/3/-23.536139/-46.6777853/Rua%20Apiac%C3%A1s/Pompeia/S%C3%A3o%20Paulo/SP/467
  scan_interval: 600
  sensor:
    - unique_id: preco_do_gas_token
      name: Preco do gas Token
      select: "#token"
      attribute: value

- name: Preco do gas API
  resource: https://api-lb.precodogas.com.br/api/lista-revendas/-23.536139/-46.6777853/
  scan_interval: 600
  headers:
    APP-KEY: 14c2529eb4498c5d1ffd6915d05bf58a91bdda796af59f41d480d11c099d0479
    Authorization: bearer {{ states("sensor.preco_do_gas_token") }}
    Content-type: "application/json"
    Accept: "text/plain"
    Content-Encoding: "utf-8"
  method: POST
  payload: '{"tipo_produto": "0","ordem": "1","rua": "Rua Apiacás","bairro": "Pompeia","cidade": "São Paulo","estado": "SP","numero": "467","origem": "2","cep": "","idrevenda": "0"}'
  log_response: true

In the log the header looks like this:

{'APP-KEY': '14c2529eb4498c5d1ffd6915d05bf58a91bdda796af59f41d480d11c099d0479', 'Authorization': 'bearer unavailable', 'Content-type': 'application/json', 'Accept': 'text/plain', 'Content-Encoding': 'utf-8'}

Is there any way to do this?

BentoAlves avatar May 08 '24 01:05 BentoAlves

@BentoAlves, once this PR is completed, you'll be able to do something like this. Per the current feature design, you will fetch the Bearer token as part of the form submit part of the configuration and then use it in the parsing itself. @danieldotnl and I are currently discussing how exactly this functionality should be exposed.

@danieldotnl, this is another interesting way to solve the problem. Although, I don't like that the bearer token is leaking into the rest of the system (where it is not needed by anything else), so I would still do it by forwarding variables from the form submit part. However, there are probably legit use cases to use global home assistant state while scrapping something? Could you do that today and @BentoAlves is not doing it correctly at the moment?

jeremicmilan avatar May 08 '24 04:05 jeremicmilan

@jeremicmilan That's very good.

@danieldotnl Being able to get a Token from a state or an input helper is also very useful. This makes it easier to change a token or even the username and password of an integration, without the need to change a file and reload the integration.

@jeremicmilan I'm definitely not doing it. This was a way of expressing my need.

BentoAlves avatar May 08 '24 17:05 BentoAlves

@jeremicmilan I think our proposal with the variables will solve @BentoAlves his case. I don't see a convincing use case to "scrape" headers into sensors. Headers are not meant to carry state-like data.

@danieldotnl Being able to get a Token from a state or an input helper is also very useful. This makes it easier to change a token or even the username and password of an integration, without the need to change a file and reload the integration.

It's not a problem to get a token from a sensor or input helper. Pulling it into the sensor is the part that isn't supported.

danieldotnl avatar May 11 '24 11:05 danieldotnl

@jeremicmilan I already implemented handling cookies as I needed it. I don't think they need to be part of the variables as I don't see a reason why they shouldn't be included in requests all the time, just like browsers do. If you are interested, you can take a look and/or review: #368

danieldotnl avatar May 11 '24 11:05 danieldotnl

@danieldotnl, implemented the feedback as part of this second PR: #374. Let's continue the discussion there.

I have to abandon this PR due to my previous mistake, where I used my master branch for this PR. I could have gotten away with that mistake, but it seems you tightened commit to master rules. :)

jeremicmilan avatar May 27 '24 00:05 jeremicmilan