newbie-herbet: Jadi apa sebenarnya Robots.txt

So what exactly is Robots.txt?
The Robots.txt is just a simple text file that sits in your server and gives useful information to search engine bots as to how they should crawl your site URLs.
The first thing a search engine bot would do before crawling your site is it would look for the ‘Robots.txt’ file located at http://yourdomain.com/robots.txt.
If you don’t have it, they’ll still crawl whatever pages they find, having it just helps make things a bit easier.
Since the bots first check your robots.txt file before crawling anything else, its a good way to instruct them to keep away from indexing any of the pages which you don’t wan’t to be visible on search results.

Caution : Never use the robots.txt to block any premium content from the site. By doing so, the search engines will surely not index those pages but since your robots.txt file is visible to even the normal user (http://yourdomain.com/robots.txt), they’ll be able to get access to this premium content. You may rather place this content behind a login restricting access to it for just the registered users or you may add the ‘noindex’ attribute to the meta tags on that specific page to avoid bots from crawling it.
What does Robots.txt look like??

The average robots.txt could be one of the simplest pieces of code you may write.
If you want to have a robots.txt instructing all search engine bots to crawl everythng they find and don’t want to give any specific instructions use the following piece of code in it.

User-Agent: *

Disallow:

The ‘User-Agent’ refers to the search engines, since you have an asterix(*) here, it means the instructions are for all engines.
The ‘Disallow’ means what section of the site should not be accessed by the search engine bots. Having nothing after that colon means everything is accessible.
For most of the simple Websites, these two lines are all you need.
If your site is a bit larger and have many folders and so on, you may want to give search engines instructions to avoid some pages.
The best example of this would be to have a printer friendly version of your website located in a specific part, say “printer-ready.” Theres no point in allowing the search engines to index both the same identical parts, so its a good idea to instruct it to avoid the printer-friendly version.
In such a situation, the User-Agent section can be left as it is so that the instructions are given to all the search engines, just a small change needs to be made in the ‘Disallow’ part.

User-Agent: *

Disallow: /printer-ready/

The forward slashes are important before and after the folder name. This folder would be tracked at the end of your domain name. The one above would be read as referring to http://yourdomain.com/printer-ready/
If it’s actually found at www.yourdomain.com/archives/printer-ready/, the robots.txt would have to be formatted in the following way.

User-Agent: *

Disallow: /archives/printer-ready/

You can also change the User-Agent part to give instructions to just a or some specific search engines. Like,

User-Agent: googlebot

Disallow: /archives/printer-ready/

In this case, the folder at http://yourdomain.com/archives/printer-ready would have access by all search engines except Google.
Techfrog’s version of Robots.txt looks like :

User-agent: *

Disallow:

Here, we are allowing access to all search engine bots, all content can be indexed and we have mentioned the location where the sitemap can be found so that the search engine bot has access to it and doesn’t leave behind any page while crawling.

How can Robots.txt be put on my site?
Once you have made the robots.txt file, just upload it to the root folder on your server and it will automatically place it in http://yourdomain.com/robots.txt
Isn’t it so Simple?

Related Article: