Griddle Noise

A year or more ago, I was really struggling with zc.buildout, a Python based tool for building out "repeatable" deployments. Buildout makes setuptools actually usable, particularly for development and deployment of web apps, although there are many other uses.

Buildout keeps everything local, allowing one app to use version 3.4.2 of one package while another app can use 3.5.2. But more than just being an 'egg' / Python package manager, it can do other tasks as well - local builds of tools (from libxml to MySQL and more), again allowing one app to build and use MySQL 5.0.x and another app to use 5.1.x; or just allowing an app to be installed onto a new box and get everything it needs, from web server to RDBMS to Memcached and beyond. We don't use all of these features (yet), but it's a nice dream.

Already it's very nice to be able to make a git clone of a customer app, run buildout, and then start it up. Buildout will put setuptools to work to ensure that proper versions of dependent components are installed (and, quite nicely, it's very easy to share both a download cache and a collection of 'installed eggs' - multiple versions living side by side, with individual buildouts picking the one they desire).

But it was not easy to get to this golden land. Prior to using Buildout, we'd check our code out of our CVS repository. Our customer apps were just another Python package, nothing special (not an application, and - more importantly - not packaged up in 'distutils' style). As we started to make more and more reusable parts, we had to do a lot of checkouts; and so I wrote a tool to help automate this checkout process. It would also check out other third party code from public Subversion repositories; all because it was easier to check out a particular tag of 'SQLAlchemy' or 'zc.table' than to try to install them into a classic-style Zope 3 'instance home'.

But it was getting harder and harder to keep up with other packages. We couldn't follow dependencies in this way, for one thing; and it required some deep knowledge of some public SVN repository layouts in order to get particular revision numbers or tags.

'Buildout' promised to change all of that, and offer us the chance to use real, honest-to-goodness distributed Python packages/eggs. But getting there was so very hard when there are deadlines beating you down.

I took a lot of my frustration out on both Setuptools (which is so goddamn woefully incomplete) and Buildout. But the fault was really in ourselves... at least, in a way. As mentioned above, it was easier to just checkout 'mypackage' into$INSTANCE_HOME/lib/python/mypackage than to figure out the install options for distutils/setuptools. As such, NONE of our code was in the Python 'distutils' style. We put some new packages into that style, but would still just check out a sub-path explicitly with CVS just like we were doing with public SVN code.

Part of the big problem that we had which made it so difficult was that we had hung onto CVS for, perhaps, too long. And doing massive file and directory restructuring with CVS is too painful to contemplate. But moving to Subversion never seemed worth the effort, and so we held on to CVS. But I knew I'd have to restructure the code someday.

Fortunately, Git arrived. Well, it had been there for a while; but it was maturing and quite fascinating and it offered us a chance to leapfrog over SVN and into proper source code management. Git is an amazing tool (perhaps made more so by being chained to CVS for so long), and it provided me with the opportunities to really restructure our code, including ripping apart single top-level packages into multiple namespaced packages (ie - instead of 'example' being the root node with 'core' and 'kickass' subpackages, I could split that into 'example.core' and 'example.kickass' as separate packages and Git repositories while keeping full histories).

For a while, I used Git with its cvsimport and cvsexportcommit tools to clean up some of our wayward branches in CVS, while starting to play with Buildout. I was still struggling to get a Zope 3 site up and running using our frameworks. And here... well, the fault was partly in ourselves for having to go through fire to get our code into acceptable 'distutils' style packages, which made learning Buildout all the more hard. But the available documentation (comprehensive, but in long doctest style documents) for some of the Zope 3 related recipes was very difficult to follow. Hell - just knowing which recipes to use was difficult!

But after many months of frustrated half-attempts, often beaten down by other pressures, I opened a few different tabs for different core Buildout recipes in my browser and furiously fought through them all... And boom! Got something working!

Unfortunately it was one of those processes where by the time I got out of the tunnel, I had no idea how exactly I had made it through. One of my big complaints as I was struggling was the lack of additional information, stories of struggle and triumph, etc. And there I was - unable to share much myself! I can't even remember when I was able to break through. It's been quite a few months. Just a couple of weeks ago we deployed our last major old customer on this new setup; and we can't imagine working any other way now.

'Git' and 'Buildout' have both been incredibly empowering. What was most difficult, for us, was that it was very difficult to make the move in small steps. Once we started having proper distutils style packages in Git, they couldn't be cloned into an instance home as a basic Python package (ie, we couldn't do the equivalent of cvs checkout -d mypackage Packages/mypackage/src/mypackage and get just that subdirectory). And we couldn't easily make distributions of our core packages and use them in a classic Zope 3 style instance home (I did come up with a solution that used virtualenv to mix and match the two worlds, but I don't think it was ever put to use in production).

So it was a long and hard road, but the payoffs were nearly immediate: we could start using more community components (and there are some terrific components/packages available for Zope 3); we could more easily use other Python packages as well (no need to have some custom trick to install ezPyCrypto, or be surprised when we deploy onto a new server and realize that we forgot some common packages). Moving customers to new server boxes was much easier, particularly for the smaller customers. And we can update customer apps to new versions with greater confidence than before when we might just try to 'cvs up' from a high location and hope everything updated OK (and who knows what versions would actually come out the other end). Now a customer deployment is a single Git package - everything else is supplied as fully packaged distributions. It's now very hard to 'break the build' as all of the components that are NOT specific to that customer have to come from a software release, which requires a very explicit action.

Labels: buildout, dvcs, git, python, scm, zope3

I've seen some posts floating around about what a distributed version control system might give you. For me, these are the key elements:

Committing changes is separate from sharing. While the phrase "you can edit on a plane!" gets thrown around quite frequently, I think this is the far more important aspect of that vision. As a developer, you don't always know how a particular path of development might impact the code base. With a purely centralized system, you have to think first about where a path may be taking you as it could affect everybody else. Time and time again, I've seen developers work without any revision control safety net for days or weeks at a time because they don't want to "break the build", and they don't have time to look up the policies for branch naming, merging, etc. Not that such a thing should take a long time, but when under pressure, it's the last thing one wants to deal with. And the untracked changes keep getting bigger, and bigger. And when I say "I've seen developers..." here, I include myself.

With a DVCS, I can commit immediately. I took heavy advantage of this on a recent project where the set of work for which I was responsible was in no state to be set up on other systems. It required new configuration and possibly new tool installations and I didn't have time to help everyone else install and update their sandboxes. They didn't need my code anyways. Instead, I was able to pull in and rebase my work on updates from my co-workers while my personal branch was in development. When my code was mature, it was easy to merge into a more centralized branch. Very easy. In fact, it was just a fast-forward (in Git parlance), since I was rebasing my changes on top of those by my colleagues.

Again, this is possible with a purely centralized system, but it would have required me to realize the significance of my changes and their impact on others. My sandbox was in a "guts on the table" kind of state. Just about every commit I made was stable, but sharing them would have made it harder on other team members to do their work due to the changes made in tool and configuration dependencies.

In essence, a DVCS inverts the control back to individuals. As a developer, I can commit my changes whenever I want. With a purely centralized system, I have to think more about what I'm about to commit, since it immediately impacts all other developers.

A DVCS encourages experimentation. Being able to commit my changes whenever I want and being able to make local branches so easily makes it easier for me to start playing around with new ideas. Whether that's doing a big refactoring or restructuring of code, experimenting with a new feature of third party library, or working on updating the code to run well on the latest release of Foo, I know that I can experiment and SAVE those experiments in small chunks, without impacting anybody else. I can choose when and how I want to share my results. I can choose to throw my experiment away and not have to be reminded by its grizzly branch name for years to come.

We have an internal toolkit that bridges SQLAlchemy and Zope 3 and provides various other useful features and integrations. Late last year, I started looking into updating the code to work with SQLAlchemy 0.4 and to also clean up some ancient hacks. We were still using CVS at the time. I can't remember when I made the branch, but I knew at some point that I was heading down a potentially long path and that a branch would be required. Other priorities were coming up and I'd have to leave this work aside for a while.

Well, there's 2-3 days of feverish work, all held hostage within a single check-in. I've been wanting to pull some features out of that branch and into the current mainline branch, but since it's all in one big check-in, I can't do that easily. This is the second time I've done that (the other time was on a feature branch that lead to a dead end, but along the way I had some good ideas that I'd love to be able to extract now).

If I'd been using Git, I would have been making more commits, more frequently, and in much smaller sizes. Using Git, I would then be able to quite easily cherry pick individual commits out of that branch and apply them to the mainline.

Finally, a DVCS makes it easy to vet changes through the system. We don't have to give new employees the keys to the kingdom, particularly when their skill set is focused on a specific area. Instead, the code can go through review channels. They make commits in their local repository, and tell someone (like me) that they've made some changes. Using Git (or any other tool - but Git's named remotes makes this hella easy), I can pull in changes from their repository, verify that they're good, and push them to the canonical repository. I can merge them into other branches, if required.

Imagine this in a team situation: team members can share their repositories with each other as needed, giving each other chances to do code reviews and fixes before sharing those changes with the larger group or division; all without requiring permission to touch the central repository. Suddenly whole new workflows open up, based on the "networks of trust" inherent in all of us: a team leader collects commits from their team, and shares those changes with other team leaders. Those team leaders pull together changes from all of their teams (while sharing said changes across team lines) and push those on to a QA / Testing division. The QA / Testing division then puts their seal of approval on things by being the ones who control pushing to the "canonical" repository from which builds are based.

There's just so much more that can be done with a DVCS, and we're in an age now where there are very usable and useful tools for this job. A DVCS restores individual responsibility, encourages experimentation, enables adaptive workflows, and I believe it fits more naturally into how we humans organize our interactions. Whether this is in a rigidly defined corporate structure or a loosely connected set of worldwide open source contributors, the peer to peer nature combined with getting the whole repository enables people to step up and do bold things without having to go through channels to get any coveted "write access".

Labels: dvcs, git, mercurial, scm

Griddle Noise

About Me

Links

Previous Posts

Archives