Application security encompasses measures taken throughout the application's life-cycle to prevent exceptions in the security policy of an application or the underlying system (vulnerabilities) through flaws in the design, development, deployment, upgrade, or maintenance of the application.

Sunday, August 5, 2012

What Is Robots.txt?

Robots.txt is a text (not html) file you put on your site to tell search engine which pages you would like them not to visit. Robots.txt is by no means mandatory for search engines but generally search engines obey what they are asked not to do. It is important to clarify that robots.txt is not a way from preventing search engines from crawling your site (i.e. it is not a firewall, or a kind of password protection) and the fact that you put a robots.txt file is something like putting a note “Please, do not enter” on an unlocked door – Ex. you cannot prevent thieves from coming in but the good guys will not open to door and enter. That is why we say that if you have really sensitive data, it is too na├»ve to rely on robots.txt to protect it from being indexed and displayed in search results.

It is great when search engines frequently visits your site and indexes your content but sometimes we don't want certain contains to be crawl. For instance, if you have sensitive data on your site that you do not want the world to see, you will also prefer that search engines do not index these pages (although in this case the only sure way for not indexing sensitive data is to keep it offline on a separate machine). Additionally, if you want to save some bandwidth by excluding images, style sheets and JavaScript from indexing, you also need a way to tell spiders to keep away from these items.

 
The location of robots.txt is very important. It must be in the main directory because otherwise user agents (search engines) will not be able to find it – they do not search the whole site for a file named robots.txt. Instead, they look first in the main directory (Ex. http://appshelter.net/robots.txt) and if they don't find it there, they simply assume that this site does not have a robots.txt file and therefore they index everything they find along the way. So, if you don't put robots.txt in the right place, do not be surprised that search engines index your whole site.

Structure of a Robots.txt File

The structure of a robots.txt is pretty simple (and barely flexible) – it is an endless list of user agents and disallowed files and directories. Basically, the syntax is as follows:

- User-agent:
- Disallow:

“User-agent” are search engines' and "disallow" lists the files and directories to be excluded from indexing. In addition to “user-agent:” and “disallow:” entries, you can include comment lines – just put the # sign at the beginning of the line:

# All user agents are disallowed to see the /temp directory.
User-agent: *
Disallow: /temp/

The Traps of a Robots.txt File

When you start making complicated files – i.e. you decide to allow different user agents access to different directories – problems can start, if you do not pay special attention to the traps of a robots.txt file. Common mistakes include spelling mistake and contradicting directives. This can be tricky to find but in some cases validation tools help.

The more serious problem is with logical errors. For instance:

User-agent: *
Disallow: /temp/
User-agent: Googlebot
Disallow: /temp/
Disallow: /cgi-bin/
Disallow: /css/

The above example is from a robots.txt that allows all agents to access everything on the site except the /temp/ directory. Up to here it is fine but later on there is another record that specifies more restrictive terms for Googlebot. When Googlebot starts reading robots.txt, it will see that all user agents (including Googlebot itself) are allowed to all folders except /temp/. This is enough for Googlebot to know, so it will not read the file to the end and will index everything except /temp/ - including /images/ and /cgi-bin/, which you think you have told it not to touch. You see, the structure of a robots.txt file is simple but still serious mistakes can be made easily.

Tools to Generate and Validate a Robots.txt File

Having in mind the simple syntax of a robots.txt file, you can always read it to see if everything is OK but it is much easier to use a validator, like this one: http://tool.motoricerca.info/robots-checker.phtml. These tools report about common mistakes like missing slashes or colons, which if not detected compromise your efforts.
 
 

No comments:

Post a Comment