Andrew Welch · Insights · #frontend #craftcms #SEOmatic

Published , updated · 5 min read ·


For more tools, technologies, and techniques, check out the devMode.fm podcast!

Preventing Google from Indexing Staging Sites

SEO­mat­ic and a mul­ti-envi­ron­ment con­fig can pre­vent Google from index­ing your stag­ing sites, and dilut­ing your SEO value

N.B.: For anoth­er take on how to han­dle this, check out the Han­dling Errors Grace­ful­ly in Craft CMS article.

It’s a pret­ty com­mon work­flow pat­tern in web devel­op­ment that we work on our projects in local dev, and we push to a staging serv­er for our client to test/​approve the work, and then final­ly push to live pro­duc­tion for the pub­lic to con­sume. This is all out­lined in the Data­base & Asset Sync­ing Between Envi­ron­ments in Craft CMS arti­cle, if you’re not famil­iar with it as a workflow.

While we absolute­ly want Google (et al) to crawl and index our client’s live pro­duc­tion site, we most def­i­nite­ly do not want the staging site indexed. The rea­son is that we don’t want to dilute our SEO val­ue by hav­ing dupli­cate con­tent on both sites, and we most cer­tain­ly don’t want the staging serv­er web­site to appear as the result of any Google searches.

So how do we work around this? A solu­tion some peo­ple use is to imple­ment .htpasswd on Apache or Nginx to pass­word pro­tect the staging serv­er. This works okay — although it can be a lit­tle annoy­ing — but it has a big down­side: we can’t use any of the exter­nal per­for­mance or SEO test­ing tools as out­lined in the A Pret­ty Web­site Isn’t Enough article.

An alternative solution is to use a combination of the robots.txt file and the <meta name=“robots”> tag to tell search engines to ignore our staging website.

If we use robots.txt and <meta name="robots"> to tell search engines to ignore our stag­ing site, then we can hap­pi­ly use our exter­nal test­ing tools, with­out hav­ing to wor­ry about dilut­ing our SEO val­ue. This arti­cle show you how to do just that using the SEO­mat­ic plu­g­in.

See the Mod­ern SEO: Snake Oil vs. Sub­stance arti­cle for more detail on SEO dilu­tion, and mod­ern SEO prac­tices, if that inter­ests you.

This arti­cle assumes that you’re using a mul­ti-envi­ron­ment con­fig as described in the Mul­ti-Envi­ron­ment Con­fig for Craft CMS arti­cle. If you’re not using CME, but rather some oth­er mul­ti-envi­ron­ment set­up, that’s fine. Just adapt the tech­niques dis­cussed here to your par­tic­u­lar setup.

Link Fight the Robots!

SEO­mat­ic makes it easy to fight off the hordes of bots that are con­stant­ly crawl­ing and index­ing your website.

If you go to the SEO­mat­icSite Meta set­tings, and scroll all the way to the bot­tom, you’ll see a robots.txt field. 

A robots.txt file is a file at the root of your site that indi­cates those parts of your site you don’t want accessed by search engine crawlers. The file uses the Robots Exclu­sion Stan­dard, which is a pro­to­col with a small set of com­mands that can be used to indi­cate access to your site by sec­tion and by spe­cif­ic kinds of web crawlers (such as mobile crawlers vs desk­top crawlers).

SEO­mat­ic auto­mat­i­cal­ly han­dles requests for /robots.txt. For this to work, make sure that you do not have an actu­al robots.txt file in your public/ fold­er (because that will take prece­dence). If there is an actu­al robots.txt file there, just delete it.

Since the robots.txt file in SEO­mat­ic is actu­al­ly parsed as a Twig tem­plate, we can eas­i­ly set up a mul­ti-envi­ron­ment con­fig that ensure that Google and oth­er search engines will ignore our staging site:

# robots.txt for {{ siteUrl }}
Sitemap: {{ siteUrl }}sitemap.xml
{% switch craft.config.craftEnv %}
    {% case "live" %}
# Live - don't allow web crawlers to index Craft
User-agent: *
Disallow: /craft/
    {% case "staging" %}
# Staging - disallow all
User-agent: *
Disallow: /
    {% default %}
# Default - don't allow web crawlers to index Craft
User-agent: *
Disallow: /craft/
{% endswitch %}

This is just a small exam­ple of what you can do; again, since it’s a Twig tem­plate, you can real­ly put what­ev­er you want here. The key con­cept is that we use the craft.config.craftEnv vari­able (which is set by Craft-Mul­ti-Envi­ron­ment or Craft3-Mul­ti-Envi­ron­ment) to change what we out­put to the robots.txt file depend­ing on the envi­ron­ment we’re run­ning in.

If you want to see what the ren­dered robots.txt looks like, you can click on the Pre­view robots.txt but­ton on the Site Meta page, or you can just view the /robots.txt on the frontend.

If you are using Nginx, ensure that you don’t have a rule in your .conf file that looks like this:

location = /robots.txt  { access_log off; log_not_found off; }

A direc­tive like this will pre­vent SEO­mat­ic from being able to ser­vice the request for /robots.txt. If you do have a line like this in your .conf file, just com­ment it out, and restart Nginx with sudo nginx -s reload.

Link Using the Meta Robots Tag

The oth­er way we can ensure that robots won’t index our site is to use the <meta name="robots"> tag. The robots meta tag lets us tell Google what we want it to do with our web­site on a page-by-page basis (as opposed to robots.txt, which works based on user-agent and URI).

If a page has a tag that looks like <meta name="robots" content="none"> then it won’t index that page or fol­low any links on it. SEO­mat­ic has you cov­ered here, too, because it comes with a mul­ti-envi­ron­ment friend­ly config.php that lets you over­ride any of its set­tings on a per-envi­ron­ment basis.

So for instance, we can tell it that regard­less of any oth­er set­tings, if the envi­ron­ment is a stag­ing envi­ron­ment, always out­put the meta robots tag as <meta name="robots" content="none"> to pre­vent index­ing and fol­low­ing of links on that page.

All you need to do is copy this exam­ple, and save it to your craft/config/ direc­to­ry as seomatic.php and SEO­mat­ic will uti­lize the settings:

<?php

/**
 * SEOmatic Configuration
 *
 * Completely optional configuration settings for SEOmatic if you want to customize some
 * of its more esoteric behavior, or just want specific control over things.
 *
 * Don't edit this file, instead copy it to 'craft/config' as 'seomatic.php' and make
 * your changes there.
 */

return array(
    // All environments
    '*' => array(
    /**
     * The maximum number of characters allow for the seoTitle.  It's HIGHLY recommend that
     * you keep this set to 70 characters.
     */
        "maxTitleLength" => 70,

    /**
     * Controls whether SEOmatic will truncate the text in <title> tags maxTitleLength characters.
     * It is HIGHLY recommended that you leave this on, as search engines do not want
     * <title> tags to be long, and long titles won't display well on mobile either.
     */
        "truncateTitleTags" => true,

    /**
     * The maximum number of characters allow for the seoDescription.  It's HIGHLY recommend that
     * you keep this set to 160 characters.
     */
        "maxDescriptionLength" => 160,

    /**
     * Controls whether SEOmatic will truncate the descrption tags maxDescriptionLength characters.
     * It is HIGHLY recommended that you leave this on, as search engines do not want
     * description tags to be long.
     */
        "truncateDescriptionTags" => true,

    /**
     * The maximum number of characters allow for the seoKeywords.  It's HIGHLY recommend that
     * you keep this set to 200 characters.
     */
        "maxKeywordsLength" => 200,

    /**
     * Controls whether SEOmatic will truncate the keywords tags maxKeywordsLength characters.
     * It is HIGHLY recommended that you leave this on, as search engines do not want
     * keywords tags to be long.
     */
        "truncateKeywordsTags" => true,

    /**
     * SEOmatic will render the Google Analytics <script> tag and code for you, if you
     * enter a Google Analytics UID tracking code in the Site Identity settings.  It
     * does not render the <script> tag if devMode is on or during Live Preview, but
     * here is an additional override for controlling it.
     */
        "renderGoogleAnalyticsScript" => true,

    /**
     * SEOmatic will render the Google Tag Manager <script> tag and code for you, if you
     * enter a Google Tag Manager ID tracking code in the Site Identity settings.  It
     * does not render the <script> tag during Live Preview, but here is an additional
     * override for controlling it.  It does render the script tag if devMode is on,
     * to allow for debugging GTM.
     */
        "renderGoogleTagManagerScript" => true,

    /**
     * This controls the name of the Javascript variable that SEOmatic outputs for the
     * dataLayer variable.  Note that the Twig variable always will be named:
     * `dataLayer` regardless of this setting.
     */
        "gtmDataLayerVariableName" => "dataLayer",

    /**
     * SEOmatic will render Product JSON-LD microdata for you automatically, if an SEOmatic Meta
     * FieldType is attached to a Craft Commerce Product.  Set this to false to override
     * this behavior, and not render the Product JSON-LD microdata.
     */
        "renderCommerceProductJSONLD" => true,

    /**
     * SEOmatic uses the `siteUrl` to generate the external URLs.  If you are using it in
     * a non-standard environment, such as a headless ElementAPI server, you can override
     * what it uses for the `siteUrl` below.
     */
        "siteUrlOverride" => '',

    /**
     * Controls whether SEOmatic will display the SEOmetrics information during Live Preview.
     */
        "displaySeoMetrics" => true,

    /**
     * Determines the name used for the "Home" default breadcrumb.
     */
        "breadcrumbsHomeName" => 'Home',

    /**
     * Determines the string prepended to the <title> tag when devMode is on.
     */
        "siteDevModeTitle" => '[devMode]',

    /**
     * This allows you to globally override the meta settings on your website.  WARNING:
     * anything you set here will REPLACE the meta settings globally.  You might wish to
     * use this, for instance, to set 'robots' to be 'none' on development/staging to
     * prevent crawlers from indexing it.  Since this config file is multi-environment aware,
     * like any Craft config file, this allows you to do just that.
     * Leave any value in the array blank to cause it to not override.
     */

        "globalMetaOverride" => array(
            'locale'                    => '',
            'seoMainEntityCategory'     => '',
            'seoMainEntityOfPage'       => '',
            'seoTitle'                  => '',
            'seoDescription'            => '',
            'seoKeywords'               => '',
            'seoImageTransform'         => '',
            'seoFacebookImageTransform' => '',
            'seoTwitterImageTransform'  => '',
            'twitterCardType'           => '',
            'openGraphType'             => '',
            'robots'                    => '',
            'seoImageId'                => '',
        ),
    ),
    // Live (production) environment
    'live' => array(
    ),

    // Staging (pre-production) environment
    'staging' => array(
    /**
     * This allows you to globally override the meta settings on your website.  WARNING:
     * anything you set here will REPLACE the meta settings globally.  You might wish to
     * use this, for instance, to set 'robots' to be 'none' on development/staging to
     * prevent crawlers from indexing it.  Since this config file is multi-environment aware,
     * like any Craft config file, this allows you to do just that.
     * Leave any value in the array blank to cause it to not override.
     */

        "globalMetaOverride" => array(
            'locale'                    => '',
            'seoMainEntityCategory'     => '',
            'seoMainEntityOfPage'       => '',
            'seoTitle'                  => '',
            'seoDescription'            => '',
            'seoKeywords'               => '',
            'seoImageTransform'         => '',
            'seoFacebookImageTransform' => '',
            'seoTwitterImageTransform'  => '',
            'twitterCardType'           => '',
            'openGraphType'             => '',
            'robots'                    => 'none',
            'seoImageId'                => '',
        ),
    ),
    // Local (development) environment
    'local' => array(
    /**
     * This allows you to globally override the meta settings on your website.  WARNING:
     * anything you set here will REPLACE the meta settings globally.  You might wish to
     * use this, for instance, to set 'robots' to be 'none' on development/staging to
     * prevent crawlers from indexing it.  Since this config file is multi-environment aware,
     * like any Craft config file, this allows you to do just that.
     * Leave any value in the array blank to cause it to not override.
     */

        "globalMetaOverride" => array(
            'locale'                    => '',
            'seoMainEntityCategory'     => '',
            'seoMainEntityOfPage'       => '',
            'seoTitle'                  => '',
            'seoDescription'            => '',
            'seoKeywords'               => '',
            'seoImageTransform'         => '',
            'seoFacebookImageTransform' => '',
            'seoTwitterImageTransform'  => '',
            'twitterCardType'           => '',
            'openGraphType'             => '',
            'robots'                    => 'none',
            'seoImageId'                => '',
        ),
    ),
);

As you can see by look­ing at the file, there are a ton of oth­er things you can con­trol in SEO­mat­ic by using this file as well, all on a per-envi­ron­ment basis.

There’s cer­tain­ly no harm is using both robots.txt and <meta name="robots"> at the same time, just to be dou­bly sure. The con­figs list­ed above, inci­den­tal­ly, are used ver­ba­tim on this very web­site that you’re reading.

Link Meat over Metal

The tru­ly use­ful part about doing a set­up this way is that you can just set it and for­get it. I can’t tell you how many sites I’ve seen where the devel­op­er has set the staging site to not be indexed (by one tech­nique or anoth­er), and then for­got to change it back when deploy­ing the site to live production.

Ooops.

That’s a big ooops” because it means the con­tent on the live pro­duc­tion site isn’t being indexed by Google or oth­er search engines.

Using a set­up like this also ensures that you don’t acci­den­tal­ly for­get to set the staging serv­er to not be indexed. Once Google has con­sumed your con­tent, it takes quite a bit of doing (and time) to make it forget.

So go forth, and become a lev­el 9 bot herder!