The Stages of Search Engine Crawling

Crawling is the very first stage of a search engine discovering your website content. Without crawling, you can’t get indexed, ranked, or, most significantly, traffic.

The process itself is remarkably complicated and completely automatic. Though not recent, in 2016, Google announced they have knowledge of 130 trillion pages.

However, research from Ahrefs also shows that 96.55% of content gets zero traffic from Google. This showcases that search engines, like Google, only crawl and index “some” content.

In this article, we’ll break down the stages of search engine crawling so you have a better understanding of how search engines crawl websites.

What is Search Engine Crawling?

Google defines the web crawling process as discovering and downloading text, images, and videos from a web page found online with bots called search engine web crawlers. Crawlers are often referred to as spiders.

These crawlers find new pages by travelling through URLs (links). They crawl sitemaps, internal link, and backlinks to find additional pages that haven’t been crawled. Once they find a new page, they extract the information to index it in their database.

Different Search Engine Bots & User Agent Strings

It’s good to know the different search engine bots and user agent strings to better understand search engine crawling.

Bots are crawlers. These are automated programs whose sole job is to discover new web pages online. User-agent strings are unique identifiers that these bots use to announce themselves when they request access to a website’s server.

This is important to know that if you block access, your page will not get ranked on a particular search engine.

Search Engine	Bot Name	User Agent String Example	Purpose
Google	Googlebot	Googlebot/2.1 (+http://www.google.com/bot.html)	Crawls and indexes web pages for Google Search and other Google services.
Bing	Bingbot	Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)	Crawls and indexes web pages for Bing Search.
DuckDuckGo	DuckDuckBot	DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)	Crawls and indexes web pages for DuckDuckGo Search.
Yandex	YandexBot	Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)	Crawls and indexes web pages for Yandex Search.
Baidu	Baiduspider	Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)	Crawls and indexes web pages for Baidu Search.
Apple	Applebot	Applebot/1.0 (+http://www.apple.com/go/applebot)	Crawls and indexes web pages for Apple services like Spotlight Search.

The Stages of Search Engine Crawling

Before serving search results, a search engine must undergo a complex crawling process. The process can be broken down into five stages. These include:

Stage 1: Discovery

The crawling process starts with URL discovery. Search engines, like Google, use bots (in this case, Googlebot) to find new pages they haven’t crawled.

Typically, search engines have a huge queue of URLs waiting to be crawled. This is why optimising crawlability is important to ensure your page gets crawled the first time.

There are several ways search engines discover new content. This is either through:

Crawling previously crawled URLs for updates or changes.
Crawling a website’s XML sitemap to find new pages.
Crawling internal and external links to find new pages.

Remember, however, “Crawling is not a guarantee you’re indexed.” – Rand Fishkin

If your web page is discovered, it doesn’t mean it’ll be indexed. Search engines need a whole set of requirements before they index your web pages.

Stage 2: Fetching

Once a URL is selected from the queue, the search engine crawler sends a request to the web server of the hosting page.

The server then responds by sending the page’s content, in most cases, in HTML format. This HTML code contains the structure and text of the page. It may even include links to other resources like images, CSS, stylesheets, and JavaScript files.

Stage 3: Parsing

After fetching the page, a crawler parses (aka analyses) the HTML content to extract information.

The type of information it extracts consists of the following:

Links: The crawler identified all the links within the HTML code, such as internal and backlinks. These are then added to the discover queue for future crawling.
Resources: The crawler also extracts references from other resources embedded in the HTML, such as images, CSS stylesheets, and JavaScript Files. These are then fetched and analysed separately.
Metadata: They also extract metadata, such as a page’s title, description, and keywords, which are used to understand the page’s content and relevance.

Stage 4: Rendering

Nowadays, pages are built using JavaScript. Therefore, crawlers cannot read the HTML code to understand a website’s content.

To fully understand the content of a webpage, search engines need to go through the rendering process. During rending, search engines execute the JavaScript code on a webpage, which then simulates how a browser would display it to a user.

This gives the crawler a better understanding of the web page content, including any dynamically generated elements.

If, however, the crawler experiences an HTTP error, like HTTP 500, the bot may crawl the page less efficiently. When this occurs, they typically adjust their crawl rate to avoid overloading the server.

Stage 5: Indexing

The final stay of the process is search engine indexing. Indexing involves storing the parsed and rendered information in a search engine’s index.

The index stores various pieces of information about each page, such as its content, links, metadata, and several other relevant signals.

This information is then used by the search engine’s ranking algorithms to determine which pages are most relevant to a particular query.

Crawling vs. Indexing

People often confuse crawling and indexing, but they’re completely different.

Here’s a side-by-side comparison to help:

Feature	Crawling	Indexing
Definition	Find new/updated web pages	Stores & organises page info
How it works	Bots follow links	Bots analyse content & extract key info
Sources	Links, sitemaps, URL submissions	Crawled pages
Goal	Gather page data for the index	Create a searchable database
Control	Robots.txt guides bots	Robots.txt controls indexed pages
Outcome	List of URLs	Searchable database for results

Crawling comes before indexing. The crawl helps with the indexing process.

How Search Engines Discover and Index Web Pages

Now that you know more about crawling, let’s look into how search engines discover and index web pages.

Discovery

It all starts in the discovery phase. We touched on this earlier, so we won’t go into too much detail, but during this stage, the crawlers need to seek out pages by:

Crawling: Search engines use bots to crawl the internet via links and try to find unknown web pages.
Sitemaps: Crawlers explore sitemaps and try to find new pages on a website they haven’t crawled.
Page Submissions: To speed up crawling, you can also manually submit individual URLs using search engine console tools like Google Search Console.

Indexing

Once a page has gone through the discovery process and has been found by crawlers, it can then be indexed. This is when the web page content is processed and stored so that search engines can retrieve it when somebody searches for relevant content.

During this process, search engines are looking at various index factors, such as:

Content quality and relevance
Keyword usage
Title tags and meta descriptions
Header tags (H1, H2, etc.)
Internal and external links
Image alt text
Page load speed
Mobile-friendliness
Structured data
Social signals
Freshness of content
Domain authority and trustworthiness
User engagement metrics
Robots.txt file demands

There are many factors that determine how well your web pages get indexed, but it’s important that you allow Google to properly crawl the most important pages of your website to be in with a chance.

Conclusion

With so much jargon out there, it’s easy to get confused on more technical terms but understanding crawling patterns and behaviours through your website can give you a headstart into understanding what Google is and isn’t liking about your site – and what it can and can’t reach. By improving your site’s XML sitemap to ensure all of your most important pages are listed, your site’s internal linking and information architecture is strong to help crawlers find each page and generally keeping in line with SEO best practices will help your site to be crawled more effectively and increase your chances of being indexed.

Need some help getting your technical SEO up to scratch? Get in touch with an expert member of our team today to see how we can help.

FAQs

How often do search engine crawlers revisit web pages?

Nobody knows as it’s an automatic process. However, you can increase your crawl speed by manually submitting a URL or sitemap or increasing internal and external linking.

Can search engine crawlers access non-text files like images and videos?

Yes. They can access non-text files like images and videos. Although they can’t visually see the image or video, they extract information like file names, alt text, captions, and surrounding text to help them understand the file.

How do search engines discover new pages?

Search engines discover new pages using search engine crawlers. These crawlers find new pages by following internal and external links, sitemaps, or through a manual page submission.

The Stages of Search Engine Crawling

What is Search Engine Crawling?

Different Search Engine Bots & User Agent Strings

The Stages of Search Engine Crawling

Stage 1: Discovery

Stage 2: Fetching

Stage 3: Parsing

Stage 4: Rendering

Stage 5: Indexing

Crawling vs. Indexing

How Search Engines Discover and Index Web Pages

Discovery

Indexing

Conclusion

FAQs

How often do search engine crawlers revisit web pages?

Can search engine crawlers access non-text files like images and videos?

How do search engines discover new pages?

Our SEO Services.

Subscribe and keep up on all things SEO.

TABLE OF CONTENTS