incubator-stormcrawler icon indicating copy to clipboard operation
incubator-stormcrawler copied to clipboard

Add basic authentication to HTTP protocol

Open jnioche opened this issue 8 years ago • 8 comments

We could store the credentials in a file in resources and/or from the tuple metadata. The class to modify is HttpProtocol

jnioche avatar Feb 23 '17 07:02 jnioche

I am working for this issue and I am trying to find proper approach. Can we use cookies to get authentication to access the websites which requires authentication in intranet? We might also need cookie management system.

isspek avatar Feb 23 '17 11:02 isspek

Great! There is already an open issue for the cookies #32 but we haven't done much work on it yet. Unlike the credentials which could be stored in a file in /resources, the cookies would be persisted like any other metadata and converted back into the objects required by the HTTP client library on the fly. That library has its own implementation of cookie storage but it would be best not to use it I think as we have no guarantee that the same JVM instance will still be alive when the next URL comes round e.g. the worker could have died. Dealing with the cookies purely as test in metadata is also more flexible as they could be passed as seeds metadata.

jnioche avatar Feb 23 '17 11:02 jnioche

@isspek any updates on this? Anything I can help you with?

jnioche avatar Mar 13 '17 16:03 jnioche

@jnioche I couldn't find a proper solution for the website which requires form based authentication. So I really need your advices.

I have generated cookies in configure of HttpProtocol:

cookieStore = new BasicCookieStore();
builder.setDefaultCookieStore(cookieStore);
getAuthorization(Parameters.get("domain.user"), Parameters.get("domain.password"));

And the getAuthorization function is used to generate cookies by using form based authentication. Currently, I couldn't find a solution for getting the parameters of the forms automatically:

private void getAuthorization(String username, String password) {
		AbstractFormBasedAuthentication hba = FormBasedAuthenticationFactory.getAuthenticator();
		hba.authenticate(username, password);
		List<Cookie> cookies = hba.getCookies();
		for (Cookie cookie : cookies) {
			cookieStore.addCookie(cookie);
		}
	}

The problem is that it is not generalized approach and the urls must be check whether is required authorization or not before processing. In my opinion the basic credentials and the type of authentication should be given in spout bolt, so we can initialize cookies and stores them in metadata.

isspek avatar Mar 14 '17 05:03 isspek

This issue is about basic authentication as described in [https://en.wikipedia.org/wiki/Basic_access_authentication]. There is ongoing work on cookies #32, maybe have a look at the exchanges there.

the basic credentials and the type of authentication should be given in spout bolt, so we can initialize cookies and stores them in metadata.

if you know the cookies in advance then you will be able to pass them in the seed metadata, the outlinks will get whichever cookies are returned by the server and so on.

jnioche avatar Mar 14 '17 12:03 jnioche

@jnioche btw the form of the website I crawl also contains parameter such as actionflag=... in addition to username and password. Therefore, the solution mentioned in #432 didn't work. So I wrote a class (FormBasedAuthenticationFactory) similar to this one (http://www.mkyong.com/java/apache-httpclient-examples/.) It works but again it is hard coded not generic since I couldn't get the parameters of the form programmatically. I will implement an approach to give credentials to the seed.

isspek avatar Mar 14 '17 15:03 isspek

Note to self: http://www.httpwatch.com/httpgallery/authentication/ contains some live examples we could use to test the authentication

jnioche avatar Mar 14 '17 16:03 jnioche

@isspek assuming we have the cookie mechanism in place, one way of doing would be to add support for POST requests. This would be triggered by some arbitrary key/value in the metadata, usually for the seeds. you'd then get the cookies you need for the outlinks. This assumes, however, that the page you'll be sending the form to contains the oulinks needed for the crawl.

http://www.baeldung.com/httpclient-post-http-request contains some code illustrating how to do that.

jnioche avatar Mar 14 '17 16:03 jnioche

Both the OKHTTP and the Apache HTTPClient support basic authentication

jnioche avatar Dec 05 '23 16:12 jnioche