27 Jul 2023

Creating a Simple Search Engine with Python

In this blog post, we will embark on an exciting journey to build a simple search engine using Python. Our search engine will be able to crawl the web, index websites, and provide basic search functionality. By the end of this tutorial, you'll have a foundational understanding of web crawling, indexing, and implementing a basic search algorithm.


Table of Contents

  1. Introduction
  2. Prerequisites
  3. Setting up the Environment
  4. Web Crawling
    • Understanding Crawling
    • Setting up Requests Library
    • Crawling a Website
    • Handling Robots.txt
    • Crawling Multiple Pages
  5. Indexing
    • Understanding Indexing
    • Parsing the Content
    • Creating an Inverted Index
  6. Implementing Search
    • Simple Keyword Search
    • Displaying Search Results
  7. Conclusion
  8. Additional Resources

Search engines play a crucial role in our everyday lives by helping us find information on the internet. Behind the scenes, search engines utilize complex algorithms, but we'll start with a simple implementation. Our search engine will have three main components web crawling, indexing, and search functionality.

Prerequisites

Before we begin, you should have a basic understanding of Python programming, HTML, and web protocols (HTTP). We'll be using Python 3.x for this project.

Setting up the Environment

Let's start by setting up the environment. Create a new directory for our project and install the necessary libraries

mkdir simple_search_engine
cd simple_search_engine
pip install requests beautifulsoup4

We'll use the requests library for making HTTP requests to crawl web pages and beautifulsoup4 to parse HTML content.

Web Crawling

Understanding Crawling

Web crawling is the process of systematically browsing the internet to gather information from web pages. We'll create a simple web crawler that follows the URLs of web pages, retrieves their content, and stores it for further processing.

Setting up Requests Library

In our Python script, let's start by importing the required libraries:

import requests
from bs4 import BeautifulSoup

Crawling a Website

To crawl a single web page, we need to send an HTTP GET request and retrieve its content. We'll use the requests library to do this:

def crawl_page(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.content
    except requests.exceptions.RequestException as e:
        print("Error crawling the page:", e)
    return None

Handling Robots.txt

Before crawling a website, it's essential to check the website's "robots.txt" file. This file tells web crawlers which pages can be crawled and which should be avoided. Let's create a function to check the "robots.txt" file:

def check_robots_txt(url):
    robots_url = url + "/robots.txt"
    try:
        response = requests.get(robots_url)
        if response.status_code == 200:
            return response.text
    except requests.exceptions.RequestException as e:
        print("Error fetching robots.txt:", e)
    return None

Crawling Multiple Pages

We'll create a function that crawls multiple pages within a domain using a breadth-first search approach:

def crawl_domain(seed_url, max_pages=10):
    crawled_pages = []
    queue = [seed_url]

    while queue and len(crawled_pages) < max_pages:
        url = queue.pop(0)

        if url not in crawled_pages:
            content = crawl_page(url)
            if content:
                crawled_pages.append(url)
                soup = BeautifulSoup(content, 'html.parser')

                # Process the content, extract links, and add them to the queue
                # ...

    return crawled_pages

In practice, you'll need to parse the content and extract links from the HTML to add them to the queue.

Indexing

Understanding Indexing

Indexing is the process of building an index of words present in the crawled pages. We'll create an inverted index, where each word is associated with a list of documents where it appears.

parsing the Content

Before building the index, we need to parse the crawled content to extract the words. We can use the nltk library for text processing. Install the library using:

pip install nltk

Now, let's create a function to tokenize the content:

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

def tokenize_content(content):
    tokens = word_tokenize(content)
    words = [word.lower() for word in tokens if word.isalpha()]
    stop_words = set(stopwords.words('english'))
    return [word for word in words if word not in stop_words]

Creating an Inverted Index

With the tokenized content, we can build an inverted index:

def build_inverted_index(crawled_pages):
    inverted_index = {}
    for url in crawled_pages:
        content = crawl_page(url)
        if content:
            tokens = tokenize_content(content)
            for token in tokens:
                if token not in inverted_index:
                    inverted_index[token] = [url]
                else:
                    if url not in inverted_index[token]:
                        inverted_index[token].append(url)
    return inverted_index

Implementing Search

Simple Keyword Search

Now that we have the inverted index, we can implement a simple keyword search function:

def search(query, inverted_index):
    query_terms = tokenize_content(query)
    if not query_terms:
        return []

    result_urls = []
    for term in query_terms:
        if term in inverted_index:
            result_urls.extend(inverted_index[term])

    return list(set(result_urls))  # Return unique URLs

Displaying Search Results

Finally, let's create a function to display the search results:

def display_search_results(query, results):
    print("Search results for '{}':".format(query))
    if not results:
        print("No results found.")
    else:
        for i, url in enumerate(results, start=1):
            print("{}. {}".format(i, url))

Conclusion

Congratulations! You've successfully created a simple search engine with Python. Although it's basic, this project provides a foundation for building more sophisticated search engines with additional features like ranking algorithms, user interfaces, and crawling strategies.

Remember that building a production-ready search engine requires handling various challenges, such as scaling, distributed crawling, and efficient indexing algorithms. Nonetheless, this simple implementation serves as a fantastic starting point for learning and experimenting with search engine development.

Additional Resources

  1. Python requests library documentation
  2. Beautiful Soup documentation
  3. NLTK documentation
  4. Web scraping and crawling with Python
  5. Introduction to Information Retrieval

Happy coding, and enjoy exploring the vast world of search engines!