robots.txt doesn't behave as expected
Our robots.txt file currently contain this:
User-Agent: *
Disallow: /dist/
Disallow: /docs/
Allow: /dist/latest/
Allow: /dist/latest/docs/api/
Allow: /api/
I'm not sure of the reason for disallowing /docs/, but whatever the case, I don't think it has the intended effect. Instead of removing it from Google, it just seems to remove Googles ability to show any meaningful content related to the link - but it still links to sites under /docs/.
Example: A search for "node.js util.inherits" shows this:

If you follow the "Learn why" link, you're told that:
[...] the website prevented Google from creating a page description, but didn't actually hide the page from Google. [...] You are seeing this result because the page is blocked by a robots.txt file on your website. (robots.txt tells Google not to read your page; if you block the page to us, we can't create a page description in search results.)
Hmmm … 🤔 I have no idea why we've set it that way. Maybe @rvagg remembers? I think this goes back to the early website days and hasn't been touched ever since.
nope! and the use of /dist/ instead of /download/ also rattles my bones, I desperately want /dist/ to be deprecated.
I think robots.txt is up for complete revision, someone suggest a new one that makes sense and make it so. https://github.com/nodejs/nodejs.org/blob/master/static/robots.txt
We had a discussion recently about the best entrypoint to the docs and I think we had differing opinions. Some people like /docs/, some do it through /api/ (I do). The docs themselves suggest doing it through /docs/latest-*/api/ is the "official" way.
Hi,
I believe the correct steps to ensure proper indexation are:
- Create a sitemap
- Upload the sitemap to the root directory
- Modify robots.txt and refer to the sitemap in it
- Submit robots.txt via Google Search Console or wait for Google to pass by
With your permission, I could make a proposal for an XML sitemap based on the output of one of the third-party tools Google suggests (although most of them seem paid or dead) at https://support.google.com/webmasters/answer/183668?hl=en
Please let me know, I'd be happy to help.
@carlos-ds Thank you for help and initiative. It'd be great if you investigate it and create PR.
Ok thanks @alexandrtovmach .
I suggest the following approach:
- I'm gonna use Sitemap Generator CLI (https://www.npmjs.com/package/sitemap-generator) to generate an XML sitemap for https://nodejs.org. I'm testing the tool right now.
- I will go through the generated sitemap manually to see the result and make changes to the generator config accordingly, then re-run the generator and validate it again until the desired result
- Someone should validate this XML sitemap
- Once the sitemap is validated, I can get to work on making changes to robots.txt
- Someone should validate the changes to robots.txt
- Robots.txt should be modified on the production server and someone with access to Google Search Console of https://nodejs.org should submit the XML sitemap
Does that sound like a good approach for this issue? I'd appreciate your feedback.
@richardlau I was rechecking this issue and nodejs.org/robots.txt gives 404. Which is reference here (https://github.com/nodejs/nodejs.org/blob/main/public/robots.txt). Is this an nginx bug? Because opening nodejs.org/manifest.json works (https://github.com/nodejs/nodejs.org/blob/main/public/manifest.json)
@richardlau I was rechecking this issue and nodejs.org/robots.txt gives 404. Which is reference here (https://github.com/nodejs/nodejs.org/blob/main/public/robots.txt). Is this an nginx bug? Because opening nodejs.org/manifest.json works (https://github.com/nodejs/nodejs.org/blob/main/public/manifest.json)
@ovflowd It's currently aliased -- I presume it has moved as part of the Next.js rewrite? https://github.com/nodejs/build/blob/faf52ba2a17983598637dc0f1c918451299a38ad/ansible/www-standalone/resources/config/nodejs.org?plain=1#L328-L331
true, I moved from the static folder to the root of the public folder. Can we maybe remove these alias from there?
I made an update in this PR (https://github.com/nodejs/build/pull/3139) I still believe we could try that PR out. (To see if everything is ✅)
We can also for now make a hot-fix to the nginx and remove those aliases. But either way, I feel confident enough that the new nginx config is working. You can create a temporary file and use nginx -t to check if the config is valid
Closing as fixed.