nodejs.org icon indicating copy to clipboard operation
nodejs.org copied to clipboard

robots.txt doesn't behave as expected

Open watson opened this issue 7 years ago • 5 comments

Our robots.txt file currently contain this:

User-Agent: *
Disallow: /dist/
Disallow: /docs/
Allow: /dist/latest/
Allow: /dist/latest/docs/api/
Allow: /api/

I'm not sure of the reason for disallowing /docs/, but whatever the case, I don't think it has the intended effect. Instead of removing it from Google, it just seems to remove Googles ability to show any meaningful content related to the link - but it still links to sites under /docs/.

Example: A search for "node.js util.inherits" shows this:

image

If you follow the "Learn why" link, you're told that:

[...] the website prevented Google from creating a page description, but didn't actually hide the page from Google. [...] You are seeing this result because the page is blocked by a robots.txt file on your website. (robots.txt tells Google not to read your page; if you block the page to us, we can't create a page description in search results.)

watson avatar Dec 18 '18 09:12 watson

Hmmm … 🤔 I have no idea why we've set it that way. Maybe @rvagg remembers? I think this goes back to the early website days and hasn't been touched ever since.

fhemberger avatar Dec 18 '18 09:12 fhemberger

nope! and the use of /dist/ instead of /download/ also rattles my bones, I desperately want /dist/ to be deprecated.

I think robots.txt is up for complete revision, someone suggest a new one that makes sense and make it so. https://github.com/nodejs/nodejs.org/blob/master/static/robots.txt

We had a discussion recently about the best entrypoint to the docs and I think we had differing opinions. Some people like /docs/, some do it through /api/ (I do). The docs themselves suggest doing it through /docs/latest-*/api/ is the "official" way.

rvagg avatar Dec 18 '18 10:12 rvagg

Hi,

I believe the correct steps to ensure proper indexation are:

  1. Create a sitemap
  2. Upload the sitemap to the root directory
  3. Modify robots.txt and refer to the sitemap in it
  4. Submit robots.txt via Google Search Console or wait for Google to pass by

With your permission, I could make a proposal for an XML sitemap based on the output of one of the third-party tools Google suggests (although most of them seem paid or dead) at https://support.google.com/webmasters/answer/183668?hl=en

Please let me know, I'd be happy to help.

carlos-ds avatar Apr 22 '19 19:04 carlos-ds

@carlos-ds Thank you for help and initiative. It'd be great if you investigate it and create PR.

alexandrtovmach avatar Jul 02 '19 10:07 alexandrtovmach

Ok thanks @alexandrtovmach .

I suggest the following approach:

  • I'm gonna use Sitemap Generator CLI (https://www.npmjs.com/package/sitemap-generator) to generate an XML sitemap for https://nodejs.org. I'm testing the tool right now.
  • I will go through the generated sitemap manually to see the result and make changes to the generator config accordingly, then re-run the generator and validate it again until the desired result
  • Someone should validate this XML sitemap
  • Once the sitemap is validated, I can get to work on making changes to robots.txt
  • Someone should validate the changes to robots.txt
  • Robots.txt should be modified on the production server and someone with access to Google Search Console of https://nodejs.org should submit the XML sitemap

Does that sound like a good approach for this issue? I'd appreciate your feedback.

carlos-ds avatar Jul 14 '19 15:07 carlos-ds

@richardlau I was rechecking this issue and nodejs.org/robots.txt gives 404. Which is reference here (https://github.com/nodejs/nodejs.org/blob/main/public/robots.txt). Is this an nginx bug? Because opening nodejs.org/manifest.json works (https://github.com/nodejs/nodejs.org/blob/main/public/manifest.json)

ovflowd avatar Mar 12 '23 13:03 ovflowd

@richardlau I was rechecking this issue and nodejs.org/robots.txt gives 404. Which is reference here (https://github.com/nodejs/nodejs.org/blob/main/public/robots.txt). Is this an nginx bug? Because opening nodejs.org/manifest.json works (https://github.com/nodejs/nodejs.org/blob/main/public/manifest.json)

@ovflowd It's currently aliased -- I presume it has moved as part of the Next.js rewrite? https://github.com/nodejs/build/blob/faf52ba2a17983598637dc0f1c918451299a38ad/ansible/www-standalone/resources/config/nodejs.org?plain=1#L328-L331

richardlau avatar Mar 13 '23 13:03 richardlau

true, I moved from the static folder to the root of the public folder. Can we maybe remove these alias from there?

ovflowd avatar Mar 13 '23 14:03 ovflowd

I made an update in this PR (https://github.com/nodejs/build/pull/3139) I still believe we could try that PR out. (To see if everything is ✅)

ovflowd avatar Mar 13 '23 14:03 ovflowd

We can also for now make a hot-fix to the nginx and remove those aliases. But either way, I feel confident enough that the new nginx config is working. You can create a temporary file and use nginx -t to check if the config is valid

ovflowd avatar Mar 13 '23 14:03 ovflowd

Closing as fixed.

ovflowd avatar Mar 15 '23 13:03 ovflowd