The Stages of Search Engine Crawling
Want to understand how search engines crawl your website so you know how to optimise it? Then you’re in the right place.
Lawrence Hitches
November 12, 2024

Crawling is the very first stage of a search engine discovering your website content. Without crawling, you can’t get indexed, ranked, or, most significantly, traffic.

The process itself is remarkably complicated and completely automatic. Though not recent, in 2016, Google announced they have knowledge of 130 trillion pages.

However, research from Ahrefs also shows that 96.55% of content gets zero traffic from Google. This showcases that search engines, like Google, only crawl and index “some” content.

In this article, we’ll break down the stages of search engine crawling so you have a better understanding of how search engines crawl websites.

What is Search Engine Crawling?

Google defines the web crawling process as discovering and downloading text, images, and videos from a web page found online with bots called search engine web crawlers. Crawlers are often referred to as spiders.

These crawlers find new pages by travelling through URLs (links). They crawl sitemaps, internal link, and backlinks to find additional pages that haven’t been crawled. Once they find a new page, they extract the information to index it in their database.

Different Search Engine Bots & User Agent Strings

It’s good to know the different search engine bots and user agent strings to better understand search engine crawling.

Bots are crawlers. These are automated programs whose sole job is to discover new web pages online. User-agent strings are unique identifiers that these bots use to announce themselves when they request access to a website’s server.

This is important to know that if you block access, your page will not get ranked on a particular search engine.

Search Engine Bot Name User Agent String Example Purpose
Google Googlebot Googlebot/2.1 (+http://www.google.com/bot.html) Crawls and indexes web pages for Google Search and other Google services.
Bing Bingbot Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Crawls and indexes web pages for Bing Search.
DuckDuckGo DuckDuckBot DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html) Crawls and indexes web pages for DuckDuckGo Search.
Yandex YandexBot Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) Crawls and indexes web pages for Yandex Search.
Baidu Baiduspider Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) Crawls and indexes web pages for Baidu Search.
Apple Applebot Applebot/1.0 (+http://www.apple.com/go/applebot) Crawls and indexes web pages for Apple services like Spotlight Search.

The Stages of Search Engine Crawling

Before serving search results, a search engine must undergo a complex crawling process. The process can be broken down into five stages. These include:

Stage 1: Discovery

The crawling process starts with URL discovery. Search engines, like Google, use bots (in this case, Googlebot) to find new pages they haven’t crawled.

Typically, search engines have a huge queue of URLs waiting to be crawled. This is why optimising crawlability is important to ensure your page gets crawled the first time.

There are several ways search engines discover new content. This is either through:

  • Crawling previously crawled URLs for updates or changes.
  • Crawling a website’s XML sitemap to find new pages.
  • Crawling internal and external links to find new pages.

Remember, however, “Crawling is not a guarantee you’re indexed.” – Rand Fishkin

If your web page is discovered, it doesn’t mean it’ll be indexed. Search engines need a whole set of requirements before they index your web pages.

Stage 2: Fetching

Once a URL is selected from the queue, the search engine crawler sends a request to the web server of the hosting page.

The server then responds by sending the page’s content, in most cases, in HTML format. This HTML code contains the structure and text of the page. It may even include links to other resources like images, CSS, stylesheets, and JavaScript files.

Stage 3: Parsing

After fetching the page, a crawler parses (aka analyses) the HTML content to extract information.

The type of information it extracts consists of the following:

  • Links: The crawler identified all the links within the HTML code, such as internal and backlinks. These are then added to the discover queue for future crawling.
  • Resources: The crawler also extracts references from other resources embedded in the HTML, such as images, CSS stylesheets, and JavaScript Files. These are then fetched and analysed separately.
  • Metadata: They also extract metadata, such as a page’s title, description, and keywords, which are used to understand the page’s content and relevance.

Stage 4: Rendering

Nowadays, pages are built using JavaScript. Therefore, crawlers cannot read the HTML code to understand a website’s content.

To fully understand the content of a webpage, search engines need to go through the rendering process. During rending, search engines execute the JavaScript code on a webpage, which then simulates how a browser would display it to a user.

This gives the crawler a better understanding of the web page content, including any dynamically generated elements.

If, however, the crawler experiences an HTTP error, like HTTP 500, the bot may crawl the page less efficiently. When this occurs, they typically adjust their crawl rate to avoid overloading the server.

Stage 5: Indexing

The final stay of the process is search engine indexing. Indexing involves storing the parsed and rendered information in a search engine’s index.

The index stores various pieces of information about each page, such as its content, links, metadata, and several other relevant signals.

This information is then used by the search engine’s ranking algorithms to determine which pages are most relevant to a particular query.

Crawling vs. Indexing

People often confuse crawling and indexing, but they’re completely different.

Here’s a side-by-side comparison to help: 

Feature Crawling Indexing
Definition Find new/updated web pages Stores & organises page info
How it works Bots follow links Bots analyse content & extract key info
Sources Links, sitemaps, URL submissions Crawled pages
Goal Gather page data for the index Create a searchable database
Control Robots.txt guides bots Robots.txt controls indexed pages
Outcome List of URLs Searchable database for results

Crawling comes before indexing. The crawl helps with the indexing process. 

How Search Engines Discover and Index Web Pages

Now that you know more about crawling, let’s look into how search engines discover and index web pages.

Discovery

It all starts in the discovery phase. We touched on this earlier, so we won’t go into too much detail, but during this stage, the crawlers need to seek out pages by:

  • Crawling: Search engines use bots to crawl the internet via links and try to find unknown web pages.
  • Sitemaps: Crawlers explore sitemaps and try to find new pages on a website they haven’t crawled.
  • Page Submissions: To speed up crawling, you can also manually submit individual URLs using search engine console tools like Google Search Console.

Indexing

Once a page has gone through the discovery process and has been found by crawlers, it can then be indexed. This is when the web page content is processed and stored so that search engines can retrieve it when somebody searches for relevant content.

During this process, search engines are looking at various index factors, such as:

  • Content quality and relevance
  • Keyword usage
  • Title tags and meta descriptions
  • Header tags (H1, H2, etc.)
  • Internal and external links
  • Image alt text
  • Page load speed
  • Mobile-friendliness
  • Structured data
  • Social signals
  • Freshness of content
  • Domain authority and trustworthiness
  • User engagement metrics
  • Robots.txt file demands
There are many factors that determine how well your web pages get indexed, but it’s important that you allow Google to properly crawl the most important pages of your website to be in with a chance.


Conclusion

With so much jargon out there, it’s easy to get confused on more technical terms but understanding crawling patterns and behaviours through your website can give you a headstart into understanding what Google is and isn’t liking about your site – and what it can and can’t reach. By improving your site’s XML sitemap to ensure all of your most important pages are listed, your site’s internal linking and information architecture is strong to help crawlers find each page and generally keeping in line with SEO best practices will help your site to be crawled more effectively and increase your chances of being indexed.

Need some help getting your technical SEO up to scratch? Get in touch with an expert member of our team today to see how we can help.

FAQs

How often do search engine crawlers revisit web pages?

Nobody knows as it’s an automatic process. However, you can increase your crawl speed by manually submitting a URL or sitemap or increasing internal and external linking.

Can search engine crawlers access non-text files like images and videos?

Yes. They can access non-text files like images and videos. Although they can’t visually see the image or video, they extract information like file names, alt text, captions, and surrounding text to help them understand the file.

How do search engines discover new pages?

Search engines discover new pages using search engine crawlers. These crawlers find new pages by following internal and external links, sitemaps, or through a manual page submission.

we’re especially specialist
Lawrence Hitches
November 12, 2024
Lawrence has played a major role in helping StudioHawk’s clients reach more customers through search while leading the team to victory with multiple SEO awards, including Semrush Awards’ Best Marketing Agency, Global Search Awards: Best Large SEO Agency in 2021, and APAC Search Awards Best Large SEO Agency 2022 & 2023.

Our SEO Services.

screen_search_desktop
Technical SEO

Great SEO starts with solid foundations. Our in-depth website audit will help us uncover any “behind the scenes” technical issues that are hindering your SEO.


Learn more

shopping_basket
eCommerce SEO

In the world of eCommerce, competition is fierce. Our eCommerce SEO specialists have mastered what works and will help you reach more shoppers with credit cards in hand.

 

Learn more

location_on
Local SEO

With 4 out of 5 customers turning to search to find local information, our local SEO services will help your business show up at the right place, right time.

 

Learn more

domain
Enterprise SEO

Great SEO starts with solid foundations. Our in-depth website audit will help us uncover any “behind the scenes” technical issues that are hindering your SEO.

 

Learn more

link
Link Building

Our link-building campaigns use ethical, 100% white-hat techniques to build high quality backlinks to your store. This shows Google you’re a trusted authority and worth putting higher in the search results!

 

Learn more

phonelink_ring
Digital PR

StudioHawk can help to minimise the loss of traffic to your new domain or CMS. We map and implement redirects, provide recommendations, help with site structure, monitor traffic, and report to you on the progress and any impact on your organic traffic.

 

Learn more

storefront
Small Business SEO

Forget generic SEO services. Every small business is different, and things change quickly. Our specialist small business SEO experts will tailor a unique SEO strategy that works best for your business, budget and niche.

 

Learn more

language
International SEO

We’ll find your audience whenever they are in the world. We’ll craft masterful campaigns that cater to their linguistic and cultural nuances and help grow your brand globally.

 

Learn more

Subscribe and keep up on all things SEO.