12/2022

Elm Robots and Humans

Humans build websites using a vast amount of tools and technologies. Elm is a functional programing language that can build reliable websites. Robots crawl websites and gather useful information for search engines.

In this note we will try to understand and build a robots.txt file content and a humans.txt file content using Elm.

elm-robots-humans.jpg

Humans are odd. They think order and chaos are somehow opposites and try to control what won't be. But there is grace in their failings.

-- Vision, Avengers: Age of Ultron

Robots.txt

Robots.txt is a public file in our websites where we can define policies, sitemaps and host name to let crawlers know where, when and what they can access and index in search engines.

A robots.txt file can look something like this:

User-agent: *
Allow: *
Sitemap: https://marcodaniels.com/sitemap.xml
Host: https://marcodaniels.com

This information will allow all robots (User-agent: *) to access all pages (Allow: *) in this website. It also points to where we can find the sitemap.

Policies

In our robots.txt we can define multiple policies for multiple user-agents.

In this example we can disallow the page /search just for the Googlebot and for Bingbot we disallow all pages in our website. At the end we make sure to allow all the other crawlers in all our pages.

User-agent: Googlebot
Disallow: /search
User-agent: Bingbot
Disallow: /
User-agent: *
Allow: *

Crawl-Delay

Crawl-delay informs crawlers to wait a specific amount of time (in milliseconds) before they can start crawling our pages.

User-agent: Googlebot
Crawl-delay: 120
Disallow: /search

This directive can be used per user-agent.

Clean-Param

Clean-param tells crawlers to remove parameter(s) from the URL's query string.

User-agent: *
Allow: *
Clean-param: id /user

This will remove the id parameter from the URL .../user?id=1234

Humans.txt

Humans.txt is a fun initiative where we can introduce the people behind our website, inform what technologies and standards we follow and provide acknowledgements and greetings for the humans who build the website.

/* Team */
Engineer: Marco Martins
/* Technology */
Elm, Terraform, Nix

The humans.txt file does not have a defined structure behind, and it can have many formats and different information as it is a file from humans to humans.

elm-robots-humans

elm-robots-humans is an Elm package that allows us to write robots.txt and humans.txt file contents in a structured and typed manner.

The Robots module exposes multiple functions and types that allows us to write policies per user-agent with all the needed directives.

robots: String
robots =
Robots.robots
{ sitemap = Robots.SingleValue "/sitemap.xml"
, host = "https://marcodaniels.com"
, policies =
[ Robots.policy
{ userAgent = Robots.SingleValue "*"
, allow = Just (Robots.SingleValue "*")
, disallow = Nothing
} |> Robots.withCrawlDelay 120
]
}

Because some properties (sitemap, userAgent, allow, disallow) can be single or multiple valued string entry we use a Value custom type that allows us to be more expressive about our policy needs. The example above will generate the following string:

User-agent: *
Allow: *
Crawl-delay: 120
Sitemap: /sitemap.xml
Host: https://marcodaniels.com

For the humans.txt side the Humans module allow us to write each "section" with just headline and content:

humans: String
humans =
Humans.humans
[ { headline = "Team"
, content = [ "Engineer: Marco Martins" ]
}
, { headline = "Technology"
, content = [ "Elm, Terraform, Nix" ]
}
]

Since the humans.txt content does not require much structure, this allows us to make sure we have an easy way to create humans content. The example would generate the string:

/* Team */
Engineer: Marco Martins
/* Technology */
Elm, Terraform, Nix

You're still here?

Thank you so much for reading this!

Go ahead and check the elm-robots-humans package and its source code in GitHub.

You can also see it in action in this website at /robots.txt and the respective source code in Github.

Until then!