robots.txt Guide India 2026 — Complete Configuration Tutorial

Q: Does blocking a page in robots.txt remove it from Google?

No, blocking in robots.txt only prevents crawling, not indexing. Already-indexed pages remain in search results. To remove a page, use noindex meta tags, Google Search Console removal tool, or return a 404/410 status code.

Q: What is the correct crawl delay for Indian shared hosting?

A crawl-delay of 5 to 10 seconds is reasonable for shared hosting. This only affects bots that respect the directive (not Googlebot). Consider upgrading to VPS if bot traffic is a persistent problem.

Q: Should I block AI training bots from crawling my site?

You can add Disallow rules for AI bot user agents like ChatGPT-User, Claude-Web, and Google-Extended if you do not want content used for AI training. This does not guarantee exclusion from AI-generated content.

Q: How often should I review my robots.txt file?

Review at least quarterly and whenever you make significant site changes. Check for outdated rules, new directories that need blocking, and ensure your sitemap reference is current.

Every website owner in India needs to understand how robots.txt works. This small text file controls which pages search engines crawl, protects sensitive admin areas from indexing, manages your crawl budget, and keeps your server from being overwhelmed by bots. A misconfigured robots.txt can hide your entire site from Google or allow competitor scrapers to steal your content. Getting it right means better SEO, faster indexing of important pages, and protection of resources. This guide covers everything you need to configure robots.txt correctly for Indian websites in 2026, from basic syntax to advanced configurations for WordPress, e-commerce, and high traffic sites.

Updated: May 2026•11 min read

What is robots.txt

robots.txt is a plain text file that lives in the root directory of your website (e.g., yourdomain.com/robots.txt). It contains instructions for web crawlers and bots about which pages they are allowed to crawl and index. The file follows the Robots Exclusion Protocol, a standard that most legitimate search engine crawlers respect.

Despite its name, robots.txt is not a security measure. It is a suggestion protocol. Malicious bots completely ignore it. However, legitimate search engines like Googlebot, Bingbot, and Yandex respect robots.txt directives. For Indian website owners, understanding robots.txt is essential for proper SEO management, especially when you need to prevent search engines from indexing admin panels, duplicate content, or staging environments.

When a search engine crawler visits your site, it requests robots.txt first. If your file exists and is properly formatted, crawlers read the directives and adjust their crawling behavior accordingly. If the file is missing or malformed, crawlers typically default to crawling everything they can find, which may not be what you want.

How robots.txt Works

Understanding how crawlers process robots.txt helps you write better directives. When a bot visits your site, it reads robots.txt before crawling any page. The bot matches each directive against the URL it plans to crawl. If a directive allows access, the bot proceeds. If it disallows access, the bot skips that URL.

Key Concepts

Crawl Budget: Search engines allocate a crawl budget to each site — the number of pages they will crawl during a given period. By blocking low value pages, you direct crawlers toward important content. This is especially important for large Indian e-commerce sites with thousands of product pages.
User Agents: Each crawler identifies itself with a user agent string. Googlebot is "Googlebot", Bingbot is "Bingbot". You can write directives for specific bots or use an asterisk (*) to target all bots.
Directive Precedence: When a URL matches both Allow and Disallow directives for the same user agent, the more specific rule wins. When in doubt, bots use the longest matching path.
Crawl Delay: The Crawl-delay directive tells bots how many seconds to wait between requests. This does not affect Googlebot much but can help if your server struggles with bot traffic from less sophisticated crawlers.

Important distinction: Disallowing a page in robots.txt does not remove it from search engine indexes if it is already indexed. It only prevents crawlers from discovering new links on that page. To actually remove a page from search results, you need to use the noindex meta tag or the Google Search Console removal tool.

Basic robots.txt Syntax

robots.txt syntax is straightforward but must follow specific rules. Each line contains a directive and its value, separated by a colon. Comments start with #. Here are the essential directives.

User-agent

Specifies which crawler the following rules apply to. Common user agents include Googlebot, Googlebot-Image, Bingbot, Slurp (Yahoo), and DuckDuckBot. Use * to apply rules to all crawlers.

User-agent: Googlebot

Disallow

Tells crawlers not to crawl specific URLs or URL patterns. An empty Disallow line means everything is allowed. Wildcards using * are supported for pattern matching.

Disallow: /admin/Disallow: /wp-admin/

Allow

Explicitly permits crawling of specific URLs, even if a parent directory is disallowed. Useful for allowing access to a subdirectory within a blocked parent directory.

Disallow: /private/Allow: /private/public/

Crawl-delay

Specifies the number of seconds a crawler should wait between requests. Use this if your server struggles with bot traffic. Note that Google ignores crawl-delay for Googlebot but many other bots respect it.

Crawl-delay: 10

Sitemap

Points crawlers to your XML sitemap location. You can specify multiple sitemaps. This is helpful for large sites with multiple sitemaps and helps search engines discover all your content more efficiently.

Sitemap: https://yourdomain.com/sitemap.xml

Pattern matching supports two special characters. Asterisk (*) matches any sequence of characters. Dollar sign ($) marks the end of a URL. For example, Disallow: /*.pdf$ blocks all URLs ending with .pdf but not URLs that just contain .pdf elsewhere in the path.

Common robots.txt Configurations

Most Indian websites need similar robots.txt configurations. Here are the most common scenarios and how to handle them.

WordPress robots.txt

WordPress sites have specific areas that should be blocked from crawlers. The wp-admin directory contains your admin panel. wp-login.php is your login page. wp-content/plugins contains plugin code that is not useful to search engines. The following configuration is a good starting point for WordPress sites.

User-agent: * Disallow: /wp-admin/ Disallow: /wp-login.php Disallow: /wp-content/plugins/ Disallow: /wp-includes/ Allow: /wp-admin/admin-ajax.php Sitemap: https://yourdomain.com/sitemap.xml

Block Admin and Login Pages

Admin panels, login pages, and backend areas should almost always be blocked. These pages have no SEO value, may contain sensitive information in URLs (like usernames in login referral URLs), and consume crawl budget that could be spent on public content.

User-agent: * Disallow: /admin/ Disallow: /login/ Disallow: /account/ Disallow: /checkout/ Disallow: /cart/ Disallow: /?add-to-cart=* Disallow: /wp-login.php Disallow: /administrator/

Block Duplicate Content

E-commerce sites often generate duplicate content through URL parameters (session IDs, tracking codes, sort orders). Indian e-commerce sites using UTM parameters for marketing campaigns should block these from indexing. Also block printer-friendly versions, mobile versions (if separate URLs), and paginated pages beyond the first.

User-agent: * Disallow: /*?utm_source=* Disallow: /*?utm_medium=* Disallow: /*?utm_campaign=* Disallow: /*?session_id=* Disallow: /*?filter=* Disallow: /*?sort=* Disallow: /page/2/ Disallow: /page/3/ Disallow: /page/4/

Allow Googlebot Only

If you want only Googlebot to crawl your site and block all other bots, you can write separate rules. This is rarely necessary for most Indian websites but may be useful for high traffic sites where non-essential bots consume significant bandwidth.

User-agent: Googlebot Allow: / User-agent: * Disallow: /

robots.txt for Indian Websites

Indian websites have some unique considerations when configuring robots.txt. Understanding local search engine behavior and regional hosting factors helps you optimize your configuration.

Google Search vs Bing in India

Google holds approximately 95% of India's search market share as of 2026. This means optimizing for Googlebot should be your primary focus. However, Bing is used by approximately 3-4% of Indian users and powers Yahoo search. If your target audience includes government or enterprise users in India, Bing may be worth considering. Write your primary directives for Googlebot and ensure they do not conflict with Bingbot requirements.

Regional Language Content

If your Indian website serves content in Hindi, Tamil, Telugu, or other Indian languages, ensure your sitemap includes hreflang annotations and that language versions are accessible to crawlers. Do not block language-specific URL patterns in robots.txt unless you intentionally do not want those versions indexed.

Shared Hosting Considerations

Many Indian businesses use shared hosting (Hostinger, BigRock, GoDaddy India, Bluehost India). Shared hosting often means sharing an IP address with hundreds of other websites. If one site on your shared IP gets flagged for scraping or bad behavior, crawlers may throttle requests to the entire IP. Proper robots.txt configuration helps ensure your site gets its fair share of crawl budget.

E-commerce Product Pages

Indian e-commerce sites on platforms like Shopify, WooCommerce, or Magento should carefully manage product page indexing. Block search result pages, filter pages, and duplicate product variations. Ensure your XML sitemap includes only canonical product URLs to avoid duplicate content issues that are common on e-commerce platforms with multiple URL variations for the same product.

Blocking Scrapers and Price Comparison Bots

Many Indian e-commerce sites face content scraping from price comparison services and competitor research tools. While sophisticated scrapers ignore robots.txt, less sophisticated ones respect it. You can use crawl-delay to slow down known scraper user agents. Common scraper user agents to consider rate-limiting include Python-urllib, libwww-perl, and various PHP-based scrapers.

Testing Your robots.txt

Never publish a robots.txt file without testing it first. A mistake can hide your entire website from search engines. Testing is straightforward and can be done using free tools.

Google Search Console robots.txt Tester

Google Search Console provides a robots.txt tester tool that lets you test your file against Google's interpretation of the rules. Navigate to the Coverage section and look for the robots.txt tester. You can paste your proposed file content, test specific URLs against it, and see exactly which rules apply to each URL.

The tester also shows you any syntax errors in your file. Common errors include using Disallow without a value (which means nothing is disallowed, not everything), using non-ASCII characters in paths, and placing directives in the wrong order. The tester highlights these issues so you can fix them before publishing.

Fetch as Google

After testing your robots.txt, use the URL Inspection tool in Google Search Console to fetch specific URLs on your site. This shows you exactly how Googlebot sees each page and confirms that robots.txt is not blocking important content. Check your homepage, major category pages, and any pages you have recently added to or removed from robots.txt.

Manual Testing

You can manually verify your robots.txt by visiting yourdomain.com/robots.txt in a browser. This shows the current live version of your file. Use curl from the command line to check the HTTP status code and headers: curl -I https://yourdomain.com/robots.txt. Ensure the file returns HTTP 200 and is accessible to crawlers. A 404 response means your file does not exist, and crawlers will crawl everything.

Monitoring Crawl Stats

After publishing changes to robots.txt, monitor your crawl stats in Google Search Console. Look for changes in the number of pages crawled per day, the number of crawl errors, and which pages are being crawled. If you blocked important pages by mistake, you will see a sudden drop in indexed pages within a few days.

Common Mistakes to Avoid

robots.txt mistakes can range from minor SEO issues to complete site invisibility in search results. Here are the most common errors Indian website owners make and how to avoid them.

Blocking All Crawlers

The most catastrophic mistake is using Disallow: / with User-agent: *. This tells all crawlers to stay away from your entire site. Your site will disappear from search results entirely. Always double-check that your Disallow directives are specific and intentional before publishing.

Incorrect Wildcard Usage

Using wildcards incorrectly can create patterns you did not intend. Disallow: /*.php blocks every URL containing .php, including your homepage if it has a .php extension. Use $ to anchor patterns when you mean end-of-URL: Disallow: /*.pdf$ only blocks URLs ending with .pdf.

Forgetting Case Sensitivity

robots.txt directives are case-sensitive. /Admin/ is different from /admin/. Most Indian websites use lowercase paths, but if your site uses mixed case URLs, ensure your robots.txt matches the actual URL casing.

Not Including Sitemap Location

Many website owners forget to add the Sitemap directive. Without it, search engines must discover your sitemap through links on your pages or through XML submission. Explicitly pointing crawlers to your sitemap accelerates discovery and ensures all your important pages are known to search engines.

Using robots.txt for Security

robots.txt is not a security measure. Sensitive URLs blocked in robots.txt are still accessible to anyone who knows the direct URL. Attackers routinely check robots.txt to find admin panels, backup files, and other interesting paths. Use proper authentication, firewall rules, and access controls for actual security.

Blocking CSS and JavaScript

Modern websites use JavaScript to render content and CSS for styling. If your robots.txt blocks /static/js/ or /static/css/, Googlebot may not be able to properly render and index your pages. Ensure your robots.txt allows access to your JavaScript and CSS directories.

Not Updating robots.txt When Moving Content

When you migrate pages or restructure your site, update robots.txt to reflect the new URLs. Search engines may still try to crawl old URLs that are now blocked, wasting crawl budget and potentially causing 404 errors that hurt your SEO.

XML Sitemap Reference

robots.txt and XML sitemaps work together as complementary SEO tools. While robots.txt controls crawling behavior by telling bots which areas to avoid, sitemaps proactively tell search engines about all the pages you want indexed. Most professional Indian websites should have both files properly configured, updated, and submitted to search engines.

Your sitemap should include only canonical URLs — the preferred version of each page excluding any duplicate versions created by URL parameters, tracking codes, or alternate views. If you have multiple language or regional versions of pages, use hreflang annotations to indicate the language and regional targeting. Indian websites serving both English and Hindi audiences should ensure both versions are properly declared in sitemaps with correct hreflang references pointing to each other.

For large Indian e-commerce sites with thousands of products, split your sitemap into multiple files: one for products, one for categories, one for blog posts, and one for static pages like your homepage and about page. Use a sitemap index file (sitemap_index.xml) to point to all individual sitemaps. Update your robots.txt Sitemap directive to point to the sitemap index rather than individual files. This keeps your robots.txt clean and makes management easier as your site grows.

Submit your sitemap to Google Search Console under the Sitemaps section. This tells Google exactly where to find your sitemap and triggers faster crawling of new content. After submission, check the sitemap coverage report regularly to identify any indexing issues with specific pages. Common issues include pages marked as excluded (which may indicate robots.txt blocking), pages with crawl errors, and pages with no-index tags that should be indexed.

Remember that adding a page to your sitemap does not guarantee indexing. The page must also be accessible (not blocked by robots.txt) and contain enough value to be crawled and indexed based on your site is overall authority. Sitemaps supplement your internal linking strategy rather than replacing it. Ensure your most important pages are linked from your homepage and other high-authority pages to maximize crawl priority.

Related Guide

For a complete guide to XML sitemaps including structure, submission, and troubleshooting, read our XML Sitemap Guide.

Frequently Asked Questions

How do I create a robots.txt file?

Create a plain text file named robots.txt in your website's root directory. The root directory is the top-level folder accessible on your domain (yourdomain.com/robots.txt). In cPanel, use the File Manager to navigate to public_html and create the file there. In WordPress, you can create it using an SEO plugin like Yoast or Rank Math, or manually via FTP or the File Manager. The file must be accessible at yourdomain.com/robots.txt for crawlers to read it.

Does blocking a page in robots.txt remove it from Google?

No, blocking a page in robots.txt does not remove it from search results if it is already indexed. It only prevents Googlebot from crawling the page. To remove an already-indexed page, you need to either add a noindex meta tag to the page, use the URL removal tool in Google Search Console, or ensure the page returns a proper HTTP 404 or 410 status code. Blocking in robots.txt just stops Googlebot from discovering changes to the page, it does not deindex existing entries.

What is the correct crawl delay for Indian shared hosting?

For shared hosting in India, a crawl-delay of 5 to 10 seconds is reasonable for polite bots. This slows down crawlers that respect the directive without significantly affecting how often search engines index new content. Remember that Googlebot does not respect crawl-delay, so this only affects Bing, Yahoo, and less sophisticated bots. If your shared hosting server struggles with bot traffic, consider using your hosting provider is built-in rate limiting or upgrading to a VPS with more resources.

Should I block AI training bots from crawling my site?

AI companies like OpenAI, Anthropic, and Google (for AI Overviews) increasingly crawl websites for training and content synthesis. As of 2026, there is no universal standard for blocking AI crawlers, but you can add specific user-agent rules. Common AI bot names include ChatGPT-User, Claude-Web, and Google-Extended. Add Disallow rules for these user agents if you do not want your content used for AI training. Note that this may not prevent your content from appearing in AI-generated responses through other means.

How often should I review my robots.txt file?

Review your robots.txt at least quarterly and whenever you make significant changes to your website structure. Trigger reviews when launching new sections, migrating content, adding a blog, or changing platforms. Look for outdated Disallow rules that no longer apply, newly created directories that should be blocked, and ensure your sitemap reference is current. Set a calendar reminder to audit your robots.txt alongside your XML sitemap to ensure both are in sync.