Andrew Welch
Published , updated · 5 min read · RSS Feed
Please consider 🎗 sponsoring me 🎗 to keep writing articles like this.
Preventing Google from Indexing Staging Sites
SEOmatic and a multi-environment config can prevent Google from indexing your staging sites, and diluting your SEO value
N.B.: For another take on how to handle this, check out the Handling Errors Gracefully in Craft CMS article.
It’s a pretty common workflow pattern in web development that we work on our projects in local dev, and we push to a staging server for our client to test/approve the work, and then finally push to live production for the public to consume. This is all outlined in the Database & Asset Syncing Between Environments in Craft CMS article, if you’re not familiar with it as a workflow.
While we absolutely want Google (et al) to crawl and index our client’s live production site, we most definitely do not want the staging site indexed. The reason is that we don’t want to dilute our SEO value by having duplicate content on both sites, and we most certainly don’t want the staging server website to appear as the result of any Google searches.
So how do we work around this? A solution some people use is to implement .htpasswd on Apache or Nginx to password protect the staging server. This works okay — although it can be a little annoying — but it has a big downside: we can’t use any of the external performance or SEO testing tools as outlined in the A Pretty Website Isn’t Enough article.
An alternative solution is to use a combination of the robots.txt file and the <meta name=“robots”> tag to tell search engines to ignore our staging website.
If we use robots.txt and <meta name="robots"> to tell search engines to ignore our staging site, then we can happily use our external testing tools, without having to worry about diluting our SEO value. This article show you how to do just that using the SEOmatic plugin.
See the Modern SEO: Snake Oil vs. Substance article for more detail on SEO dilution, and modern SEO practices, if that interests you.
This article assumes that you’re using a multi-environment config as described in the Multi-Environment Config for Craft CMS article. If you’re not using CME, but rather some other multi-environment setup, that’s fine. Just adapt the techniques discussed here to your particular setup.
Link Fight the Robots!
SEOmatic makes it easy to fight off the hordes of bots that are constantly crawling and indexing your website.
If you go to the SEOmatic→Site Meta settings, and scroll all the way to the bottom, you’ll see a robots.txt field.
A robots.txt file is a file at the root of your site that indicates those parts of your site you don’t want accessed by search engine crawlers. The file uses the Robots Exclusion Standard, which is a protocol with a small set of commands that can be used to indicate access to your site by section and by specific kinds of web crawlers (such as mobile crawlers vs desktop crawlers).
SEOmatic automatically handles requests for /robots.txt. For this to work, make sure that you do not have an actual robots.txt file in your public/ folder (because that will take precedence). If there is an actual robots.txt file there, just delete it.
Since the robots.txt file in SEOmatic is actually parsed as a Twig template, we can easily set up a multi-environment config that ensure that Google and other search engines will ignore our staging site:
# robots.txt for {{ siteUrl }}
Sitemap: {{ siteUrl }}sitemap.xml
{% switch craft.config.craftEnv %}
{% case "live" %}
# Live - don't allow web crawlers to index Craft
User-agent: *
Disallow: /craft/
{% case "staging" %}
# Staging - disallow all
User-agent: *
Disallow: /
{% default %}
# Default - don't allow web crawlers to index Craft
User-agent: *
Disallow: /craft/
{% endswitch %}
This is just a small example of what you can do; again, since it’s a Twig template, you can really put whatever you want here. The key concept is that we use the craft.config.craftEnv variable (which is set by Craft-Multi-Environment or Craft3-Multi-Environment) to change what we output to the robots.txt file depending on the environment we’re running in.
If you want to see what the rendered robots.txt looks like, you can click on the Preview robots.txt button on the Site Meta page, or you can just view the /robots.txt on the frontend.
If you are using Nginx, ensure that you don’t have a rule in your .conf file that looks like this:
location = /robots.txt { access_log off; log_not_found off; }
A directive like this will prevent SEOmatic from being able to service the request for /robots.txt. If you do have a line like this in your .conf file, just comment it out, and restart Nginx with sudo nginx -s reload.
Link Using the Meta Robots Tag
The other way we can ensure that robots won’t index our site is to use the <meta name="robots"> tag. The robots meta tag lets us tell Google what we want it to do with our website on a page-by-page basis (as opposed to robots.txt, which works based on user-agent and URI).
If a page has a tag that looks like <meta name="robots" content="none"> then it won’t index that page or follow any links on it. SEOmatic has you covered here, too, because it comes with a multi-environment friendly config.php that lets you override any of its settings on a per-environment basis.
So for instance, we can tell it that regardless of any other settings, if the environment is a staging environment, always output the meta robots tag as <meta name="robots" content="none"> to prevent indexing and following of links on that page.
All you need to do is copy this example, and save it to your craft/config/ directory as seomatic.php and SEOmatic will utilize the settings:
<?php
/**
* SEOmatic Configuration
*
* Completely optional configuration settings for SEOmatic if you want to customize some
* of its more esoteric behavior, or just want specific control over things.
*
* Don't edit this file, instead copy it to 'craft/config' as 'seomatic.php' and make
* your changes there.
*/
return array(
// All environments
'*' => array(
/**
* The maximum number of characters allow for the seoTitle. It's HIGHLY recommend that
* you keep this set to 70 characters.
*/
"maxTitleLength" => 70,
/**
* Controls whether SEOmatic will truncate the text in <title> tags maxTitleLength characters.
* It is HIGHLY recommended that you leave this on, as search engines do not want
* <title> tags to be long, and long titles won't display well on mobile either.
*/
"truncateTitleTags" => true,
/**
* The maximum number of characters allow for the seoDescription. It's HIGHLY recommend that
* you keep this set to 160 characters.
*/
"maxDescriptionLength" => 160,
/**
* Controls whether SEOmatic will truncate the descrption tags maxDescriptionLength characters.
* It is HIGHLY recommended that you leave this on, as search engines do not want
* description tags to be long.
*/
"truncateDescriptionTags" => true,
/**
* The maximum number of characters allow for the seoKeywords. It's HIGHLY recommend that
* you keep this set to 200 characters.
*/
"maxKeywordsLength" => 200,
/**
* Controls whether SEOmatic will truncate the keywords tags maxKeywordsLength characters.
* It is HIGHLY recommended that you leave this on, as search engines do not want
* keywords tags to be long.
*/
"truncateKeywordsTags" => true,
/**
* SEOmatic will render the Google Analytics <script> tag and code for you, if you
* enter a Google Analytics UID tracking code in the Site Identity settings. It
* does not render the <script> tag if devMode is on or during Live Preview, but
* here is an additional override for controlling it.
*/
"renderGoogleAnalyticsScript" => true,
/**
* SEOmatic will render the Google Tag Manager <script> tag and code for you, if you
* enter a Google Tag Manager ID tracking code in the Site Identity settings. It
* does not render the <script> tag during Live Preview, but here is an additional
* override for controlling it. It does render the script tag if devMode is on,
* to allow for debugging GTM.
*/
"renderGoogleTagManagerScript" => true,
/**
* This controls the name of the Javascript variable that SEOmatic outputs for the
* dataLayer variable. Note that the Twig variable always will be named:
* `dataLayer` regardless of this setting.
*/
"gtmDataLayerVariableName" => "dataLayer",
/**
* SEOmatic will render Product JSON-LD microdata for you automatically, if an SEOmatic Meta
* FieldType is attached to a Craft Commerce Product. Set this to false to override
* this behavior, and not render the Product JSON-LD microdata.
*/
"renderCommerceProductJSONLD" => true,
/**
* SEOmatic uses the `siteUrl` to generate the external URLs. If you are using it in
* a non-standard environment, such as a headless ElementAPI server, you can override
* what it uses for the `siteUrl` below.
*/
"siteUrlOverride" => '',
/**
* Controls whether SEOmatic will display the SEOmetrics information during Live Preview.
*/
"displaySeoMetrics" => true,
/**
* Determines the name used for the "Home" default breadcrumb.
*/
"breadcrumbsHomeName" => 'Home',
/**
* Determines the string prepended to the <title> tag when devMode is on.
*/
"siteDevModeTitle" => '[devMode]',
/**
* This allows you to globally override the meta settings on your website. WARNING:
* anything you set here will REPLACE the meta settings globally. You might wish to
* use this, for instance, to set 'robots' to be 'none' on development/staging to
* prevent crawlers from indexing it. Since this config file is multi-environment aware,
* like any Craft config file, this allows you to do just that.
* Leave any value in the array blank to cause it to not override.
*/
"globalMetaOverride" => array(
'locale' => '',
'seoMainEntityCategory' => '',
'seoMainEntityOfPage' => '',
'seoTitle' => '',
'seoDescription' => '',
'seoKeywords' => '',
'seoImageTransform' => '',
'seoFacebookImageTransform' => '',
'seoTwitterImageTransform' => '',
'twitterCardType' => '',
'openGraphType' => '',
'robots' => '',
'seoImageId' => '',
),
),
// Live (production) environment
'live' => array(
),
// Staging (pre-production) environment
'staging' => array(
/**
* This allows you to globally override the meta settings on your website. WARNING:
* anything you set here will REPLACE the meta settings globally. You might wish to
* use this, for instance, to set 'robots' to be 'none' on development/staging to
* prevent crawlers from indexing it. Since this config file is multi-environment aware,
* like any Craft config file, this allows you to do just that.
* Leave any value in the array blank to cause it to not override.
*/
"globalMetaOverride" => array(
'locale' => '',
'seoMainEntityCategory' => '',
'seoMainEntityOfPage' => '',
'seoTitle' => '',
'seoDescription' => '',
'seoKeywords' => '',
'seoImageTransform' => '',
'seoFacebookImageTransform' => '',
'seoTwitterImageTransform' => '',
'twitterCardType' => '',
'openGraphType' => '',
'robots' => 'none',
'seoImageId' => '',
),
),
// Local (development) environment
'local' => array(
/**
* This allows you to globally override the meta settings on your website. WARNING:
* anything you set here will REPLACE the meta settings globally. You might wish to
* use this, for instance, to set 'robots' to be 'none' on development/staging to
* prevent crawlers from indexing it. Since this config file is multi-environment aware,
* like any Craft config file, this allows you to do just that.
* Leave any value in the array blank to cause it to not override.
*/
"globalMetaOverride" => array(
'locale' => '',
'seoMainEntityCategory' => '',
'seoMainEntityOfPage' => '',
'seoTitle' => '',
'seoDescription' => '',
'seoKeywords' => '',
'seoImageTransform' => '',
'seoFacebookImageTransform' => '',
'seoTwitterImageTransform' => '',
'twitterCardType' => '',
'openGraphType' => '',
'robots' => 'none',
'seoImageId' => '',
),
),
);
As you can see by looking at the file, there are a ton of other things you can control in SEOmatic by using this file as well, all on a per-environment basis.
There’s certainly no harm is using both robots.txt and <meta name="robots"> at the same time, just to be doubly sure. The configs listed above, incidentally, are used verbatim on this very website that you’re reading.
Link Meat over Metal
The truly useful part about doing a setup this way is that you can just set it and forget it. I can’t tell you how many sites I’ve seen where the developer has set the staging site to not be indexed (by one technique or another), and then forgot to change it back when deploying the site to live production.
Ooops.
That’s a big “ooops” because it means the content on the live production site isn’t being indexed by Google or other search engines.
Using a setup like this also ensures that you don’t accidentally forget to set the staging server to not be indexed. Once Google has consumed your content, it takes quite a bit of doing (and time) to make it forget.
So go forth, and become a level 9 bot herder!