Building a scalable time-series database on PostgreSQL
Michael J. Freedman is the co-founder and CTO of TimescaleDB, an open-source database that scales SQL for time-series data, and a Professor of Computer Science at Princeton University. His research focuses on distributed systems, networking, and security.
Previously, Freedman developed CoralCDN (a decentralized CDN serving millions of daily users) and Ethane (the basis for OpenFlow / software-defined networking). He co-founded Illuminics Systems (acquired by Quova, now part of Neustar) and is a technical advisor to Blockstack.
Honors include: Presidential Early Career Award for Scientists and Engineers (PECASE, given by President Obama), SIGCOMM Test of Time Award, Caspar Bowden Award for Privacy Enhancing Technologies, Sloan Fellowship, NSF CAREER Award, Office of Naval Research Young Investigator Award, DARPA Computer Science Study Group membership, and multiple award publications. Prior to joining Princeton in 2007, he received his Ph.D. in computer science from NYU's Courant Institute, and his bachelors and masters degrees from MIT.
No video of the event yet, sorry!
Today everything is instrumented, generating more and more time-series data streams that need to be monitored and analyzed. When it comes to storing this data, many developers often start with some well-trusted system like PostgreSQL, enjoying the convenience of having their data in one place, with time-series data stored alongside relational data and queried together using SQL. But when their data hits a certain scale, many give up Postgres' query power and ecosystem by migrating to some NoSQL or other "modern" time-series architecture. They face the traditional trade-off: query power or scale, and effectively silo their data.
In this talk, I describe why this perceived trade-off isn't necessary, and how we've built an efficient, scalable time-series database engineered up from PostgreSQL. In particular, the nature of time-series workloads one finds in DevOps, monitoring, IoT, finance, and elsewhere -- inserting new data about recent events -- presents very different demands than general transactional (OLTP) workloads. We've architected our time-series database to take advantage of and embrace these differences.
The system architecture automatically partitions data across both time and space, even though it exposes the illusion of a single continuous table -- a hypertable -- across all of your data spread across one or many servers. Its distributed query optimizations both hide the fact that users are interacting with many “chunks” of data, which are right-sized by volume and time constraints, and minimize which and how chunks are accessed to answer queries. In fact, the database supports all-of-SQL (not "SQL-like") against this hypertable (e.g., secondary indexes, rich query predicates and group bys, aggregations, windowing functions, CTEs, JOINs).
Through performance benchmarks, I show how the database scales much better than PostgreSQL, even on a single node. In particular, by appropriately sizing chunks, it avoids the "performance cliff" that vanilla PostgreSQL experiences once reaching table sizes of 10s-100s of millions of rows, while offering some compelling query performance improvements.
The database is implemented as a PostgreSQL extension, released under the Apache 2 license. A single-node beta release is available on GitHub, with the clustered version under development.
- 50 min
- PGConf Local: Philly 2017