As the old saying goes there are no bad databases, only bad data. But the bunch of us swapping stories in a lower east side dive are reaching a different conclusion: that you’d be amazed where data gets dumped by People Who Really Ought to Know Better.
People like us. We’ve seen it. And if your bleary-eyed pundits are to be believed through their tears, the first annual Very Expensive Startup Database awards go out as follows:
- Honorable mention - ColdFusion, a fine database. But not in
2016, and not in an application that eschewed modern luxuries like foreign
JOINclauses. (“The first developer knew one database, and so that’s the one he used. It’s kind of impressive, really, except that now we’re stuck with it.”)
- (Tie) Honorable mention - DynamoDB, a fine database under the right circumstances; also the basis of a very expensive science-fair project.
- 3rd place - Redis (and only Redis) at scale. An excellent cache, but not without its limits (“we scaled vertically and scaled vertically and then we ran out of hardware. Which was… an interesting 48 hours.”)
- 2nd place - an HTTP log used as the event-stream at the heart of a quasi-event-sourced architecture (MariaDB was involved, too; that had its own heartache)
- 1st place - the filesystem (“yeah, you’d be shocked to learn that data corruption and disk failure turned out to be major failure modes”)
We won’t give too much away, but if you’re preparing a contender for next year’s awards, you’ll notice a couple of themes that might help your case:
- 40% of awards went to perfectly capable databases being asked to do too much
- 60% of awards involved experiments escaping the lab. “It ran on my laptop!” But prizes are only awarded in production.
- And about that first place winner? None of us are quite sure what to think.
Now, none of us are database experts, and rather than the nitty-gritty technical details of why a database would hold back a high-growth, nimble startup, all we have is a biased, end-user perspective about what works and what doesn’t.
A little more about that. We’ve each spent our careers in and out of various early-stage technology ventures, where the maturity curve might as well be a flat line and the uncertainty inherent in any decision (technical or otherwise) is rivaled only by the crazy ambitions of the people making it. Statistically speaking, most startups never make it past this stage. By extension, most technology decisions happen here, too. Ours may not be a comprehensive point of view, but if you’re there, too, the perspective may work to your advantage.
With that all out on the table, here’s an attempt to work through one of the many decisions your new startup (or side project) will need to make–and to keep you seated firmly in the audience at next year’s Expensive Database awards.
A startup’s database might be the second biggest factor in determining its trajectory. People are a runaway number one, but as far as technology choices go a database has tremendous influence over where resources are spent. Just as Conway’s law anticipates an organization patterned after its communication structures, its ambitions will be structured, constrained, and (occasionally) enabled by the database at their heart.
Which is all to say: an ambitious startup needs a lot from its database.
Don’t overthink this. Just like good design, a good database should be unobtrusive. It should only draw attention when it breaks, and for the most part it shouldn’t. It should do its job and free up time for the rest of the business.
Before we get into the details, one very important question: “does this startup even need one [yet]?” If you’ve set up a landing page to help tease a stealth launch, just keep it on disk (it’s once you start tracking live, high-volume marketing campaigns that “the filesystem” no longer holds up—as People Who Really Ought To Know Better can confirm). If you’re running a prototype with smoke, mirrors, and elbow grease, a spreadsheet can track the details and easily adjust with your evolving ‘schema’.
If the benefits outweigh the costs, though, or you’re determined to run the startup on something that will make your investors less nervous than that growing Google Sheet, here are some dimensions to consider.
Software cheapens the process of iterating on physical systems. The whole point is that it’s easy to change. An experiment, even.
Software is also inherently complex, which means that things break. The less there is to break, the more time that’s left to build, test, and change everything else. In other words: a database (plus the infrastructure to manage it) should be proven. Reliable. Trustworthy. It should enable experimentation (meaning learning and change), but it shouldn’t be an experiment itself.
There’s a big difference between iterating on the application domain and getting caught on the layers beneath. The former lets you solve problems in a novel way (and if you’re lucky, a way that customers will pay for). The latter is an expensive distraction.
A startup’s database can’t afford to be “just” an application backend. As you scramble to keep up with a growing business, the DB may be pressed into temporary service for customer service, business analytics, marketing, and anything else that references or interacts with the data inside.
Most mainstream databases exist at the center of a swirling ecosystem of no-code frameworks; data analysis tools; visualization dashboards; migration/management tooling; management consoles; and supporting controls.
If these sorts of off-the-shelf parts aren’t available, however, building them eats away at the startup’s all-important pool of development time. In the early days, problems outside your core domain are usually better solved with cash (or better yet, starter plans and free trials).
Money can be raised. Time is irreplaceable.
The creators of successful databases pay inordinate attention to detail so the rest of us don’t have to. But all databases face a certain set of operational constraints, and while different assumptions about the data or access patterns can drastically change a database’s performance under certain conditions, the tradeoffs will turn up in others.
Remember the CAP theorem? Consistency, Availability, and Partition Tolerance–and at an early stage, likely prioritized in that order. You can spend your way to higher availability, at least at a certain scale, and partitions are a problem for another day, but eventually-consistent databases solve a problem you won’t (yet) have.
“ACID” is another acronym to keep handy for trivia night. Atomicity, consistency isolation, and durability. If you’re looking to minimize surprises, these are good guarantees to have. Yes, they also guarantee bottlenecks, eventually, but by minimizing surprises along the way, they’ll greatly increase the odds that the project lives to see those bottlenecks firsthand.
Take advantage of the database. Learn to use the tools that help (foreign keys, say, with the occasional carefully-considered index) and avoid the ones that don’t. Keeping to a core set of well-understood features reduces complexity, documentation, and the odds of unexpected fallout in the application sitting on top.
But the less the database allows, the better. If hooks and conditionals and all the trappings of Turing-completeness are there, so is the temptation to use them. And once application logic leaks into the database, well, now you have two problems.
The bottom line is that the business is the business, and (with the obligatory nod to Larry Ellison’s yacht collection) the database usually isn’t. The data may be. In which case the system hosting them should be relatively easy to manage, alter, access, and control. Developing and maintaining database features in house will cost more than it adds. Better to keep the business focused elsewhere.
With those principles in mind, let’s line up the contenders by breaking them down into two broad categories: relational databases and everything else.
Relational databases like MariaDB and Postgres) store data in tables described by explicit schema and related through common keys. RDBMSs tend to be fairly conservative in their design, putting strong guarantees ahead of raw performance or scalability. The line separating them from “everything else” gets a bit blurry (MariaDB is a relational database with support for columnar storage, and Postgres is a SQL interface over everything but the kitchen sink), but their relational structure sets them apart from other, more specialized stores.
The non-relational camp breaks down into many subcategories. Some of the most common include:
- Key-value stores (DynamoDB, Redis) store data as
(key, value)pairs, delivering good performance with very few bells and whistles
- Document stores (MongoDB, CouchDB) are a special case of key-value store built around the name (key) and contents (value) of stored document. They work well for unstructured or semi-structured data, and typically offer more features than “pure” key value stores
- Wide-column stores (Google BigTable, Cassandra) store data in columns, enabling performant queries over enormous volumes of data
- Graph databases (Neo4j, Amazon Neptune) model networked relationships as nodes and edges in a graph
- Search engines (Elasticsearch) use clever indices and rules-based heuristics to provide general queries over large volumes of (generally textual) data
We can break the NoSQL camp down further into “general purpose” databases and those aimed at specific niches, then toss out most of the latter as two specialized for a new startup. The contender will usually have
Well, we’ve made it. And just as your hosts have seen a few horror stories in their day, we’ve also seen it when things turn out right. Your may needs be different than ours, but here’s the bottom line: if you really do need a database, and don’t have a clear case for another one, reach for the nearest RDBMS.
We’ll make it simpler: just use Postgres.
Key-value stores, columnar DBs, search engines, logs, ledgers, and in-memory esoteria all have roles to play in established applications, but introducing them only when needed to shore up the shortcomings of your more generally-capable database.
Yes, Postgres has scalability limits and “simple” like a David Foster Wallace novel. But just because dedicated types for JSON and XML exist doesn’t mean you’re obligated to use them. And if you don’t, you still have a database that’s ACID-compliant, supported almost everywhere, won’t get in the way of hiring, and for the most part Just Works. The tools you need to manage and use it already exist, and you’ll spend more time building your application than filling in missing parts of the database itself.
And when you outgrow it, be sure to send us a check!
Very Expensive Startup Database Awards Foundation New York, NY 10002