Improve user agents database
A lot of my crawl depends on proper user-agent strings. It's a bit hard to supply user agents using a config as we're doing now. It would be good to have a database of user agents and to pick user agents from it. I am thinking of a standalone application with a simple interface, which then could be integrated into crawly.
We could get a database from:http://www.useragentstring.com/pages/api.php or any other service.
Hey @oltarasenko , just stumbled upon this issue today. I'd suggest Faker's internet UserAgent module! Let me know if you need help, i'll try to chip in a PR if possible!
@sreecodeslayer Looks very promising. Please try to sketch a PR. I will be able to help if needed.
Sure, I'll try this over the weekend!
Am thinking of having two ways of configuration for this.
- ~~The users should have an option to set default list of user-agents that they wish to use, under configuration. The items from the list can then be chosen randomly when setting up request options.~~ Oops! Looks like I spoke too soon. I think this is already possible
- By default, Crawly should fallback to an item from
Faker.Internet.UserAgent. Now one thing to be aware here is the use of random user agents. ie., Using a mobile user agent on websites can at times mess up the xpaths [when the website supports both devices using separate css classes and what not!]. So it might be better to have this [type-of-user-agent-device] as a config upfront if the user is not setting a default list of user-agents
Your thoughts? 😄
From practice: Sometimes it's a pain to find an appropriate set of user agents for a given website :( Just as you have stated, they would render something completely different in some cases. Also in some cases, we were pretending to be old android devices, which allowed to query API directly.
So what I want to have:
- Possibility to override user agent strings with just string (as it's done now). So in these cases it's possible to chose something more concrete
- Generate user agent string based on a category: mobile/desktop/etc from the Faker API
Suggestions: Currently, user agents are handled in the https://github.com/oltarasenko/crawly/blob/master/lib/crawly/middlewares/user_agent.ex.
Maybe we can make a twist here, to just call Faker.Internet.UserAgent.desktop_user_agent() (or similar) if user agents are not provided as a list... It sounds a bit untidy, but let's start playing with it, to see, maybe we will find a clean way of directing the code to Faker