Advanced Postgres Schema Design

(For Scalable Application Development)

PGConf NYC 2021

December 3, 2021
New York, NY

About Me

  • Sehrope Sarkuni
  • Founder of JackDB, Inc. (jackdb.com)
  • Maintainer of pgjdbc (Java driver)
  • Contributor to node-postgres (Node.js driver)

Schemas ↔ Code

Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious.
Fred Brooks, The Mythical Man Month (1975)
Bad programmers worry about the code. Good programmers worry about data structures and their relationships.
Linus Torvalds, LKML (2006)

Schemas ↔ Code

Dumb schemas require smart code.
Smart schemas stop dumb code.
Sehrope Sarkuni, PGConf NYC (2021)

One DB ↔ Many Apps

  • Concurrent access
  • Some read / Some write
  • Schema = API

What's a schema?

  • What you can do
  • What you can't do

Schema — Can Do's

  • Read (SELECT)
  • Write (INSERT / UPDATE / DELETE)
  • Execute (Functions & Procedures)

Performance Matters

Too slow ≈ Impossible

Schema — Can't Do's

  • Prevent invalid data in fields
  • Prevent invalid relationships
  • Prevent incomplete data

Schema Priority Order

  1. Consistency
  2. Efficiency
  3. Simplicity

Can't Do > Can Do

Consistency > Efficiency

Efficiency > Simplicity

Bottom Up

  1. Columns
  2. Relations (Tables & Views)
  3. Relationships (Foreign Keys)
  4. Permissions (Grants)

Bottom Up

  1. Data Types
  2. Columns
  3. Relations (Tables & Views)
  4. Relationships (Foreign Keys)
  5. Permissions (Grants)

Data Types

  • boolean
  • int / bigint
  • date
  • timestamp / timestamptz
  • text
  • uuid

More Data Types

  • interval
  • ranges
  • numeric
  • arrays
  • inet / cidr
  • Many more ...

Size Matters

Smaller ≈ Faster

UUID


						/* UUID stored as text, 
						   e.g. "7c9a0778-566b-49a7-8743-205009b836aa" */
						CREATE TABLE bad_idea (
						  id_as_text   text
						);

						/* UUID stored as native type */
						CREATE TABLE good_idea (
						  id_as_uuid   uuid
						);
					

UUID


						INSERT INTO bad_idea
						  SELECT md5('' || x)::uuid::text
						  FROM generate_series(1, 1000000) x;
						
						INSERT INTO good_idea
						  SELECT md5('' || x)::uuid
						  FROM generate_series(1, 1000000) x;
						
						table_name | table_size
						-----------+-----------
						bad_idea   | 65 MB
						good_idea  | 42 MB
					

Smaller and faster

UUID


						INSERT INTO good_idea (id_as_uuid)
						VALUES ('Not_A_UUID');

						-- ERROR:  invalid input syntax for type uuid: "Not_A_UUID"
					

Smaller, faster, and correct

Data Alignment


							CREATE TABLE bad_alignment (
								a  boolean,
								b  bigint,
								c  boolean,
								d  bigint,
								e  boolean,
								f  bigint,
								g  boolean,
								h  bigint
							);
						
					

						CREATE TABLE good_alignment (
							b  bigint,
							d  bigint,
							f  bigint,
							h  bigint,
							a  boolean,
							c  boolean,
							e  boolean,
							g  boolean
						);
					

Column definition order matters

Data Alignment


						INSERT INTO bad_alignment
						  SELECT true, 1, true, 2, true, 3, true, 4
						  FROM generate_series(1, 1000000) x;
						
						INSERT INTO good_alignment
						SELECT 1, 2, 3, 4, true, true, true, true
						  FROM generate_series(1, 1000000) x;
						
						table_name      | table_size 
						----------------+------------
						bad_alignment   | 89 MB
						good_alignment  | 65 MB
					

Smaller rows ≈ More rows per page ≈ Faster

Data Alignment


					    [12345678 12345678 12345678 12345678 12345678 12345678]
					    [B        DDDDDDDD B        DDDDDDDD B        DDDDDDDD]
					    [DDDDDDDD DDDDDDDD DDDDDDDD BBB-----]

					"B" = Boolean        "DDDDDDDD" = Double        "-" = Unused
					

(Go biggest to smallest)

Text is for Text

And it's great at it

Avoid in text fields:

  • UUIDs — Use uuid
  • Numbers — Use int / bigint / numeric
  • Dates — Use date
  • Timestamps — Use timestamptz
  • JSON — Use json / jsonb
  • Binary Data — Use bytea

CHECK Constraints


						CREATE TABLE person (
							id        uuid PRIMARY KEY,
							username  text NOT NULL,
							name      text NOT NULL,
							CONSTRAINT person_ck_username_pattern
							  CHECK (username ~* '^[a-z][a-z0-9_]+[a-z0-9]$'),
							CONSTRAINT person_ck_name_length
							  CHECK (LENGTH(name) <= 255)
						);
					
Check yourself before you wreck yourself
Ice Cube, Check Yo' Self (1992)

CHECK Constraints


						INSERT INTO person (id, username, name) VALUES (
						  'b4e4b397-5104-f16c-0d4a-b95bab4a8eae'
						  '#bad#',
						  'Alice');

						-- ERROR:  new row for relation "person" violates
						--        check constraint "person_ck_username_pattern"
						
						INSERT INTO person (id, username, name)
						SELECT 'b4e4b397-5104-f16c-0d4a-b95bab4a8eae' AS id,
						       'good' AS username,
						       repeat('too_long_', 100) AS name;
						
						-- ERROR:  new row for relation "person" violates
						--        check constraint "person_ck_name_length"
					

Explicit naming of constraints helps too!

Multi-Column CHECKS


						CREATE TABLE widget (
							id                  uuid PRIMARY KEY,
							manager_person_id   bigint REFERENCES person(id),
							manager_group_id    bigint REFERENCES person(id)
							CONSTRAINT widget_ck_manager
							  CHECK (   (manager_person_id IS NULL)
							         != (manager_group_id IS NULL))
						);
					
Enforce that exactly one type of manager field is populated.

Arrays


						SELECT '{foo,bar,baz,bam}'::text[]
					
  • Usually a bad idea
  • Lots of gotchas
  • Sometimes a great idea

Array Gotchas


						SELECT '{{foo,bar},{baz,bam}}'::text[]
					
  • Dimensions
  • Duplicates
  • Ordering

Array Checks


						CREATE OR REPLACE FUNCTION array_sort(anyarray)
						  RETURNS anyarray AS
						  $BODY$
						    SELECT array_agg(elem ORDER BY elem)
						    FROM UNNEST($1) AS elem;
						  $BODY$
						  LANGUAGE SQL IMMUTABLE;
						
						CREATE TABLE foo (
						  some_array        text[] NOT NULL,
						  CONSTRAINT foo_ck_some_array_dim
						    CHECK (array_ndims(some_array) = 1),
						  CONSTRAINT foo_ck_some_array_order
						    CHECK (some_array = array_sort(some_array))
						);
					
Enforce dimensions and sort order to guarantee it for all reads.

Avoid: Flex Fields

  • Free form / no structure
  • #1 cause is fear of migrations
  • #2 cause is fear of bureaucracy

Three Kinds of Databases

  • Evolving schemas
  • Read-only
  • Dropped

Schema Migrations

  • Lots of tools available
  • Simple as .sql files
  • Transactional DDL is awesome

Forward Progress

  • Schemas only evolve forward
  • Reverting is a new step forward
  • DB migrations are part of app code

Testing Locally


						/* Create a backup */
						CREATE DATABASE app_bk
						  TEMPLATE app
						  OWNER app_owner;

						/* Run your DB migrations here ... */

						/* No good? */
						DROP DATABASE app;

						CREATE DATABASE app
						  TEMPLATE app_bk
						  OWNER app_owner;
					

Minimal Downtime

  • Change less
  • Change often
  • Change carefully

Locks = Downtime

Lock Duration

  • "Zero"
  • Short
  • Long

"Zero"

  • Creating new tables
  • Creating new types
  • Creating indexes on new tables
  • Adding data to new tables

Short Locks

  • CREATE INDEX ... CONCURRENTLY
  • ... ADD COLUMN ...
  • ... DROP COLUMN ...
  • ... ADD CONSTRAINT ... NOT VALID
  • ALTER TYPE ... ADD VALUE ...

Long Locks

  • ALTER VIEW ...
  • CREATE INDEX ...
  • ... ADD CONSTRAINT ...

Data Access Patterns

  • Known and constant — index lookup
  • Known and small — index scan
  • Known and big — index scan(s)
  • Unknown

Indexes

  • Redundant data → fast access
  • Everything has a cost
  • Bloat / Write amplification

Indexes

PostgreSQL guarantees consistency

(And consistency is always priority #1)

B-tree Indexes

  • Default type (CREATE INDEX ...)
  • Supports equality (WHERE a = 1)
  • Supports ranges (WHERE a > 1)

Unique Indexes


						CREATE UNIQUE INDEX foo_uk_bar_id
						  ON foo (bar_id);

						/* This succeeds */
						INSERT INTO foo (bar_id) VALUES (1);

						/* This fails */
						INSERT INTO foo (bar_id) VALUES (1);
						-- error: duplicate key value violates
						--        unique constraint "foo_uk_bar_id"
					

PostgreSQL guarantees we cannot insert duplicates

(And consistency is always priority #1)

Partial Indexes


						CREATE INDEX foo_ix_active_bar_id
						  ON foo (bar_id)
						  WHERE active;
						
						SELECT t.*
						FROM foo t
						WHERE t.active
						  AND t.bar_id = 123
					

Only data matching the WHERE clause is indexed

(Smaller ≈ Faster)

Including Indexes


						CREATE INDEX foo_ix_a_including_b_c
						  ON foo (a)
						  INCLUDE (b, c);
						
						/* Index-only scan is possible */
						SELECT
						  t.a,
						  t.b,
						  t.c
						FROM foo t
						WHERE t.a = 123
					

Read non-key fields without going back to table

(Efficiency > Simplicity)

Expression Indexes

  • a.k.a. "Functional Indexes"
  • Must be immutable
  • Incredibly useful

Expression Indexes


						CREATE INDEX foo_ix_lower_name
						  ON foo (lower(name));
						
						/* This will use our index */
						SELECT t.*
						FROM foo t
						WHERE lower(name) = 'alice'
					

Combine Them All


						/* UNIQUE ensures we cannot insert duplicates */
						CREATE UNIQUE INDEX foo_uk_lower_name_active
						  /* Indexed value is the lower case transformation */
						  ON foo (lower(name))
						  /* Only include rows which are active */
						  WHERE active;
					

PostgreSQL guarantees we cannot insert active duplicates with the same lower case name

(And consistency is always priority #1)

Views

  • Hide complexity
  • Backward compatibility
  • Ad-hoc convenience

Views


						CREATE VIEW active_account_detail AS
						SELECT
						  a.id,
						  a.manager_id,
						  p.name AS manager_name,
						  a.department_id,
						  d.name AS department_name,
						  a.created_at
						FROM account a
						  JOIN person p ON p.id = a.manager_id
						  JOIN department d ON d.id = a.department_id
						WHERE a.is_active
					

Advanced Views


						CREATE VIEW recent_sales AS
						SELECT
						  s.id,
						  s.sale_date,
						  s.sale_price,
						  s.quantity,
						  s.product_id,
						  p.name AS product_name,
						  p.unit_price,
						  s.customer_id,
						  c.name AS customer_name
						FROM generate_series(0, 30) x
						  JOIN sale s ON s.sale_date = (now()::date - x)
						  JOIN product p ON p.id = s.product_id
						  JOIN customer c ON c.id = s.customer_id
					

Set Returning Functions

(Views on Steroids)


						CREATE FUNCTION accessible_product (IN p_person_id int)
						  RETURNS SETOF product
						  LANGUAGE SQL
						  AS
						  $BODY$
						    SELECT p.*
						    FROM product p
						    WHERE p.owner_id = p_person_id
						    OR EXISTS(
						        SELECT 1
						        FROM product_group pg
						          JOIN person_group x ON x.group_id = pg.group_id
						        WHERE pg.product_id = p.id
						          AND x.person_id = p_person_id)
						  $BODY$
					

Embed parameters and hide details

Hotel Room Problem

How to prevent overlapping bookings of the same room?

Hotel Room Problem

(Not So Smart Schema Edition)

  • Two date columns: start and end
  • Add CHECK: (start < end)
  • Add UNIQUE: (room_id, start)
  • ???
  • Serialize via SELECT FOR UPDATE
  • Check for overlaps in app code
  • Hope nothing else modifies the data

Using Range Types / Exclusion


						CREATE TABLE reservation (
							id              bigserial PRIMARY KEY,
							room_id         bigint NOT NULL REFERENCES room (id),
							booking_dates   daterange NOT NULL,
							-- Prohibit overlapping bookings for the same room:
							CONSTRAINT reservation_uk_room_id_booking_dates
							  EXCLUDE USING gist
								(room_id WITH =, booking_dates WITH &&),
							-- Ensure that the daterange are half open
							CONSTRAINT reservation_ck_booking_dates_valid
							  CHECK (        lower_inc(booking_dates)
									 AND NOT upper_inc(booking_dates))
						);
					

PostgreSQL will ensure this constraint is enforced

(And consistency is always priority #1)

Smart Schemas

  1. Build apps fast
  2. Build fast apps

(And PostgreSQL allows for some really smart schemas)

Q&A

Sehrope Sarkuni

sehrope@jackdb.com

Thank You!