Wednesday, May 15, 2013

MongoDB San Francisco


Reprinted from an email sent to Hyperfine clients regarding Rob's attendance of the MongoDB conference in San Francisco:

I attended the MongoDB conference at the Palace Hotel in San Francisco May 10th of last week. The conference, hosted by 10gen (http://www.10gen.com), was very well attended with roughly 1000 attendees by my estimation. You can visit the conference page at http://www.10gen.com/events/mongodb-san-francisco-2013 for a list of even sponsors, but I'll mention here that I spoke with representatives from MongoLab, Microsoft Open Technologies, and StrongLoop, among others.

The main takeaways from the conference were that Mongo is growing in popularity, robustness, toolsets, and best practices. While the debate continues to swirl over traditional relational database approaches versus NoSQL in general and Mongo in particular, this feels to me more and more a distraction from making progress on individual projects. Arguments feel like debates over computer science theoretical principles, but just as RSS flourished during its time, and JSON has overwhelmed XML and XML Schema, simplicity rules in the world of rapid, agile development, and Mongo is certainly a strong player in the agile world. The benefits I've suggested in the past regarding JavaScript-everywhere and leveraging that language/object-description ubiquity seemed to me to be a strong theme at the conference.

The keynote address, shared by Eliot Horowitz and Max Schireson of 10gen, covered the growth in functionality and usage of Mongo. Eliot discussed the query optimizer, which has been a part of Mongo since Day 1, and noted that 10gen is in the process of incrementally rearchitecting in order to support upcoming new features, including adaptive query evaluation and plans. The goal is to provide more insights into performance, and better automatic adaptation for improved performance. 10gen has recently released Mongo backup services, which sounds like a huge win and was noted by other non-10gen presenters, and is working on automation tools for database management. Eliot emphasized the use of the new backup services to go beyond only disaster recovery scenarios, including data sampling from production deployments to be used in development and in data analysis. Similar approaches were discussed in breakout sessions regarding the use of Mongo replication.

The day was segmented into a selection of several lectures, each lasting around 40 minutes, with an offering of five per time segment. I chose a sampling from their development and operational tracks. The first was a talk by Charity Majors from Parse, an intermediate-level discussion of managing a maturing MongoDB ecosystem. Charity is in charge of Parse's data and database operations and gave a high-level overview of the sorts of strategies and scripts her team runs to keep Parse running. This was a deeper talk than I anticipated (there were several people in the audience nodding their head in agreement at certain points in the talk, indicating to me that they were probably DBAs managing Mongo systems in their groups), so I tried to get a general sense of the gotcha issues in running a production deployment. I doubt these issues are greatly different than running a production SQL Server or Oracle environment, but the key takeaway is that any production deployment is going to require a dedicated team (one or more persons) who monitor and maintain the database. Charity talked about the strategies she's learned over time, the rough percentage thresholds in performance and capacity to watch for, and the sorts of scripts/tools needed to have in hand prior to banging into a major downtime problem. She reiterated throughout that this is more an art than science, often particular to the sort of application/service being deployed. It's clear that these personnel, once hired and determined to be good, should be nurtured in an organization. It would be painful to lose that institutional knowledge.

As an aside, Charity mentioned Facebook's purchase of Parse. Her demeanor, small joke, and giggle, seemed to imply there was less than overwhelming happiness at Parse regarding the buyout. Interesting.

Jason Zucchetto of 10gen gave a talk on Mongo schema design which I attended next. I was particularly interested in this talk since I've been building data models against MongoDB, using the Mongoose schema/driver layer to model data for a Hyperfine project. One of my concerns has been what are the best practices for defining schema and adapting schema over the product life cycle. Jason's talks confirmed most of my assumptions, the base of which is that under Mongo and NoSQL databases in general, schema design is less of a waterfall approach than agile: Figure out your anticipated usage patterns and define your schema from there. While that doesn't sound radically different from a SQL approach, the difference lies primarily in the use of denormalized schema. Under Mongo, it's important to let go of normalization and to use data duplication strategically in anticipation of the sorts of queries and data presentation the application will rely on. Mongo is very flexible in how queries can span contained documents, and arrays holding one-many or many-many relationships. It always comes down to doing the right thing for the application, but for much of the time, normalized data is the enemy of performance and flexibility. Jason pointed out that for many RDBMS scenarios, populating a web page may take several queries, but a properly defined Mongo schema can return data in one or two gulps. Reflecting back to Eliot's and Charity's talks, Mongo's profiler tools can be of huge help in identifying choke points. Again, this isn't much different from what tools SQL offers; the issue is the approach.

I also attended a talk by Achille Brighton of 10gen covering the basics of replication in MongoDB. Replication allows you to establish a primary database node in a cluster, with one or more secondaries that duplicate the data. A voting algorithm within the cluster identifies the primary, with heartbeats watching for nodes that go down and subsequent elevation of secondaries. The Mongo tools appear to make configuring replication sets easy, with configurations defined in JSON. Replication is used not only for robustness, but can be used to satisfy geographical distribution for fast reads against local servers, and can also be used for data analysis against secondary nodes such that this will not affect the main application's performance. Replication and sharding often go hand in hand, but I was not able to attend any sharding strategy talks.

Max Schireson, the CEO of 10gen, gave a talk on indexing and query optimization. I've read in online posts people struggling with MongoDB performance, often dismissing Mongo out of hand due to the performance issues they encountered, but I've been suspicious of these just as I would be of novices discussing SQL Server perf issues. Max pointed out that the difference between a query that takes 2 seconds and 20 milliseconds is often a matter of a properly defined index. He noted that some index selection issues could be subtle, but are nonetheless critical to get right. Using the profiling tools to identify these issues are important. He noted that Mongo internally attempts to identify the best query plan and these algorithms are continually being improved in later releases. It's clear to me that ad-hoc queries should be avoided in application development, with specific data access methods written within the schema models so that performance knowledge can be contained within one area of the code and not scattered throughout the codebase. I've been doing this with Hyperfine code, and the pattern is similar to writing well-tuned stored procedures in SQL with language-based data access layers.

Of the pre-session talks I had, the most interesting was with Will Shulman, the CEO and co-founder of MongoLab. Hypefine is using MongoLab to host its MongoDB instances and I had a fun discussion with Will regarding some of the connectivity issues I've encountered with hosting on Azure, as well as his general computer science background from Stanford. I also talked with Joo Mi Kim, who is MongoLab's VP of finance and operations. Next to MongoLab's booth was Microsoft's Open Technologies group. I spoke with the evangelist manning that booth and brought up an issue Hyperfine and some of its clients have had in being bound to Windows machines to do Node.js deployments. The tool at the center of this, cspack.exe, is used both by Visual Studio and the Azure PowerShell tools for packaging up a deployment. I was amazed to hear the evangelist say he never heard of it. He asked me to send him an email regarding the issue, so I did. If I hear anything useful back from him on this issue, I'll pass it along to interested parties.

I had lunch with a team from a San Francisco tech company who seemed skeptical of Mongo. As their chief developer put it, "It seems useful for some scenarios," but this statement seemed to me a tautology with not much insight. I talked to them about some of the things I've discussed with folks on this email distribution regarding agility, JavaScript-centric architectural approaches, advantages to operations, institutional knowledge, and adaptability. I think that may have opened their eyes a bit further to the possibilities. It was very interesting to hear the almost complete disregard for all things Microsoft within this group, and they attributed their attitude to much of the Valley. Microsoft and its technologies seem to be second cousins in the Valley. I saw this in my visit to Stanford as well.

Included within the conference swag bag was a copy of "The Little MongoDB Book" which can also be found at https://github.com/karlseguin/the-little-mongodb-book. This book offers a great overview of MongoDB and can be read in an afternoon.

Finally, for completeness sake, I'll include a link to a forum thread sent to me by a former colleague of mine at Microsoft in the '90s. Both of us are veterans/victims of the architectural purity versus rapid deployment battles (which no one but Windows NT Cairo team members will remember). The discussion covers relational databases versus NoSQL databases. I think some of it misses the point, but it's an interesting discussion nonetheless. https://news.ycombinator.com/item?id=5696451

-- Rob Bearman

No comments:

Post a Comment