What comes to mind if I say “robots” hmm I guess something from some terminator movie or some new AI series on netflix. Well unless you work in search engine optimization. For search engine optimization experts robots are one of the most low-level concepts.
Practically a robot refers to a text file with a few instructions for web crawlers so that they know what to index and more importantly what to leave out. In their primitiveness robots are very effective and should be part of every basic search engine optimization review. Uses vary from restricting crawlers to specific parts of your website, could be an administration screen of just data that will not look good if crawled.
What does a Robots.txt file contain ?
Basic parts of the robots.txt file consist of:
- User-agent — User agent specifies which crawler this instruction is for. User agents can be specific like Googlebot or Bingbot or else a wild card * to instruc them all.
- Disallow — This command is to instruc robots not crawl this area. Example Disallow: /admin/ this means do not index anything under the admin folder.
- Allow — On the other hand this specifically tells robots to crawl this area. Normally used for exceptions in the disallow list.
- Crawl-delay — This is like a pasu and tells robots to wait a certain number of seconds before continuing the crawl.
- Sitemap — Some pages might have multiple site maps so you can use this top specify the sitemap location.
- Noindex — This command is used to instruct Google crawler to remove pages from the index.
This robots text file also supports commands like
- # — comments out a line so it will not be read.
- * — match any text.
- $ — the URL must end here.
What you need to know about your robot ?
First of all this is a UTF-8 text file and should always be robots.txt with no other extensions. As primitive as it can be this file needs to be in the website root folder. In case of subdomains these need their own robots.txt. Note that crawlers can ignore these files but normally do not. Though it is not the case in browsers urls in these files are case sensitive so take care.
One should also note the difference betsween Disallow and Noindex. Disallow is a command to simply suggest crawlers not go cover a particular path. This is not a deindex as the path may be reached through aother ways like for example If someone links to a page externally. Crawl-delay is most of the time ignored by the Google crawlers yet this can be managed by the crawl settings in Google Search Console. Though the noidex command in robots works it is not the best approach to block crawlers. Ideally this is done on the actual page using meta robots or x-robots.
What can be fun about this robot text file ?
If you are ont of those old timers and recall the days of batch files and chat scripts you know that ascii can be fun. Yes cause abusing the comment # commanand with enough time to waste you can come with some really cool human instructions in your robot