Manish Atri logo
Manish Atri
databases

Understanding Database Index Internals: A Deep Dive with PostgreSQL

Understanding Database Index Internals: A Deep Dive with PostgreSQL
0 views
5 min read
#databases

Database indexes are crucial for query performance, yet they remain one of the most misunderstood concepts in database management. In this deep dive, we'll explore how indexes work internally, examine different index types, and understand their performance implications using PostgreSQL as our example.

What Are Database Indexes?

Think of a database index like the index in a book. Instead of flipping through every page to find a topic, you can quickly jump to the right page using the index. Similarly, database indexes provide a fast path to locate data without scanning every row in a table.

The Cost of Not Having Indexes

Let's look at a practical example. Consider a table of users:

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255),
    username VARCHAR(50),
    created_at TIMESTAMP
);

Without an index, finding a user by email requires a sequential scan:

EXPLAIN ANALYZE SELECT * FROM users WHERE email = '[email protected]';

-> Seq Scan on users (cost=0.00..483.00 rows=1 width=244)
   Filter: (email = '[email protected]'::text)

Types of Indexes in PostgreSQL

1. B-tree Index (Default)

B-tree (Balanced tree) is PostgreSQL's default index type, and for good reason. It's efficient for equality and range queries.

CREATE INDEX idx_users_email ON users(email);

B-tree indexes work by maintaining a balanced tree structure where:

  • Leaf nodes contain the actual data/pointers
  • Non-leaf nodes contain guide posts for navigation
  • All paths from root to leaf have the same length
B-tree Index Structure

When to Use B-tree

  • Equality comparisons (=, <>)
  • Range queries (<, <=, >, >=)
  • BETWEEN operations
  • IS NULL checks
  • Pattern matching (LIKE 'foo%')

2. Hash Index

Hash indexes are optimized for equality comparisons only.

CREATE INDEX idx_users_email_hash ON users USING HASH (email);

Internal Structure

  • Creates a hash table where keys are hashed values of indexed columns
  • Fixed-size bucket arrays
  • Good for exact matches, unsuitable for range queries
Hash Index Structure

3. GiST (Generalized Search Tree)

GiST indexes are ideal for geometric data and full-text search.

CREATE TABLE locations (
    id SERIAL PRIMARY KEY,
    position POINT,
    area POLYGONb
);

CREATE INDEX idx_locations_position ON locations USING GIST (position);

Use Cases

  • Geometric operations
  • Nearest neighbor searches
  • Full-text search
  • Custom data types

4. GIN (Generalized Inverted Index)

GIN is perfect for indexing array values and full-text search.

CREATE TABLE articles (
    id SERIAL PRIMARY KEY,
    title TEXT,
    tags TEXT[]
);

CREATE INDEX idx_articles_tags ON articles USING GIN (tags);

Internal Structure

  • Stores pairs of key and posting list
  • Each key can appear multiple times
  • Excellent for arrays and full-text search
  • Slower to build but faster to search

Performance Implications

Index Size and Maintenance

Indexes come with storage and maintenance costs:

SELECT pg_size_pretty(pg_total_relation_size('users')) as table_size,
       pg_size_pretty(pg_indexes_size('users')) as indexes_size;

Write Performance

  • INSERT operations need to update all indexes
  • More indexes = slower writes
  • Updates affecting indexed columns require index updates
Index Performance Graph

When Indexes Might Not Be Used

Indexes aren't always used, even when they exist:

  1. Small tables where sequential scans are faster
  2. Large result sets (>15% of total rows)
  3. Stale statistics
  4. Complex expressions

Index Maintenance

Regular Maintenance Tasks

-- Rebuild index to remove bloat
REINDEX INDEX idx_users_email;

-- Update table statistics
ANALYZE users;

Monitoring Index Usage

SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch 
FROM pg_stat_user_indexes;

Best Practices

  1. Don't Over-Index

    • Each index adds overhead to writes
    • Only index columns used in WHERE, JOIN, or ORDER BY
  2. Compound Indexes

    CREATE INDEX idx_users_email_username ON users(email, username);
    • Order matters in compound indexes
    • Most selective column should be first
  3. Partial Indexes

    CREATE INDEX idx_active_users ON users(email) WHERE active = true;
    • Index only rows matching a condition
    • Smaller, more efficient indexes
  4. Regular Maintenance

    • Monitor index usage
    • Rebuild bloated indexes
    • Keep statistics up to date

When Not to Use Indexes

Indexes aren't always beneficial:

  • Small tables
  • Tables with frequent bulk updates
  • Columns with low selectivity
  • Columns rarely used in queries

Conclusion

Database indexes are powerful tools for query optimization, but they require careful consideration. Understanding their internal workings helps in making informed decisions about:

  • When to create indexes
  • Which type of index to use
  • How to maintain them effectively

Remember: the best indexing strategy balances query performance with maintenance overhead and storage costs. Regular monitoring and adjustment ensure your indexes continue to serve their purpose effectively.

Further Reading and Resources

You can analyze your own database's index usage and performance using the queries shared in this post. Remember to test in a development environment first, as index creation can lock tables and impact performance.