crawly icon indicating copy to clipboard operation
crawly copied to clipboard

Take the RobotsTxt User-Agent from the Request

Open adonig opened this issue 1 year ago • 2 comments

This pull request updates the RobotsTxt middleware to dynamically use the User-Agent from each request instead of relying on a hardcoded value. It supersedes an earlier attempt, ensuring that the changes merge cleanly without the previous issues.

adonig avatar Apr 11 '24 07:04 adonig

Maybe there's a way to squash all those commits into one 😅

adonig avatar Apr 15 '24 07:04 adonig

Do you believe this test is sufficient?

  test "Respects the User-Agent header when evaluating robots.txt" do
    :meck.expect(Gollum, :crawlable?, fn
      "My Custom Bot", _url -> :crawlable
      _ua, _url -> :uncrawlable
    end)

    middlewares = [
      {Crawly.Middlewares.UserAgent, user_agents: ["My Custom Bot"]},
      Crawly.Middlewares.RobotsTxt
    ]

    req = @valid
    state = %{spider_name: :test_spider, crawl_id: "123"}

    assert {%Crawly.Request{}, _state} =
             Crawly.Utils.pipe(middlewares, req, state)

    middlewares = [Crawly.Middlewares.RobotsTxt]

    assert {false, _state} = Crawly.Utils.pipe(middlewares, req, state)
  end

adonig avatar Apr 25 '24 11:04 adonig