This is archived from chewbranca.github.io. Old metadata:
- layout: post
- title: "Rewriting the CouchDB HTTP layer"
- date: 2014-08-17 10:19:23
- categories: tech
- tags: erlang couchdb http chttpd2
With the light at the end of tunnel on the BigCouch merge, I thought it was time to get the conversation going on cleaning up the current HTTP stack duality. We've got a good opportunity to do some major cleanup, remove duplication, and really start more clearly separating the various components of CouchDB.
Primary objectives
* Consolidate down to one HTTP layer
* Isolate HTTP functionality
* Separate HTTP server from HTTP resources
* Easy plugin integration
* Build clustered/local API
Consolidate down to one HTTP layer
We currently have two HTTP layers, couch_httpd
and chttpd
. This
was a useful construct when BigCouch was a separate application where
isolating the clustered layer from the local layer was necessary, and
quite useful.
This is no longer the case, and we can significantly reduce code
duplication by consolidating down to one http layer. There are a
number of places in the two apps where the code is nearly identical,
except one calls out to fabric
and the other calls out for
couch_*
. For instance, compare couch_httpd_db:couch_doc_open/4
[1]
with chttpd_db:couch_doc_open/4
[2]. These are completely identical
aside from whether it goes through the clustered layer, fabric
, or
through the local layer couch_db
.
There are plenty of other places with similar duplication. This is obviously ripe with opportunity to refactor and introduce some higher level abstractions to make the HTTP layer function independently of the document/database level APIs.
Isolate HTTP functionality
I don't think couch_doc_open/4
has any business existing in
the HTTP layer, we should move all non HTTP logic out. IMO the HTTP
layer should only concern itself with:
1. Receiving the HTTP requests
2. Extracting out the request data into a standard data structure
3. Dispatch requests to the appropriate internal APIs
4. Forward the response
Anything that doesn't fit into those four steps should be ripped out
and moved elsewhere. For instance, the primary logic for determining the
database redundancy and shard values is done in chttpd_db
[3]. I
would greatly prefer to see this logic in a database API.
The more we can isolate HTTP logic from database logic the better. Once they are fully decoupled, then the HTTP layer is merely one particular client interface on top of the core database. We also get all the benefits of isolation for testing and what not.
Along these lines, I think we greatly overuse the #http{} record for passing around request data, and instead you extract the body, and then combine all of the user supplied headers and query string params into a standard options list. This we can we completely separate making database requests from the representation of the client request.
Separate HTTP server from HTTP resources.
I think everything I've said so far is pretty clear cut in terms of it's the logical thing to do, but separating the HTTP server from the HTTP endpoints is less clearly defined. However, we do have precedence for this and there are a number of solid benefits.
First, let me explain what I mean here. There are two pieces to an HTTP stack, first there's the core HTTP engine that handles receiving and responding to requests and other things along those lines, and second there's the places where you supply your business logic and figure what content to send to the user.
CouchDB has a handful of places using this aproach, where instead of
defining all the logic in the HTTP stack directly, we have auxilary
modules defined within the appropriate applications that specify how
any HTTP requests for that application are handled. A good clean
example of this approach is couch_mrview_http
[4].
Easy plugin integration
One big advantage of the above separation of HTTP resources is that it provides a standard way of plugins hooking in new HTTP endpoints. The more we can treat the "core" CouchDB applications as plugins, the more easily it is to isolate and replace various parts of the stack.
Build clustered/local API
The above example of couch_doc_open/4
is a clear cut case where
we want to abstract the process of loading a document. Not all places
are as easily abstractable, but this is a great example of why I think
we should have a standard API on top of clustered and local layers,
where deciding which to use is based on a local/clustered flag, or
some other heuristic.
I've been toying around with the idea of making a request object of
some sort, is something like couch_req:make(ReqBody, ReqOptions)
that you can then pass to couch_doc_api
or some such, but I don't
have any strong opinions on this.
Where I've gotten so far: chttpd2, a proof of concept
I've hacked out an experimental WebMachine [5] based rewrite of the
HTTP stack called chttpd2
[6]. This PoC follows the same ideas I've
outlined above, so I'll run back through the previous outlined items
and explain how chttpd2
handles it.
Consolidate down to one HTTP layer
Right now I'm not doing anything special here, I still think building an API layer that handles deciding whether to make a clustered or local request is the proper approach, so I've not included any logic in the HTTP stack for doing so.
Isolate HTTP functionality
I've got a solid separation of functionality in chttpd2
. If you
notice the current codebase in [6], there is zero logic for actually
handling any particular CouchDB requests. Rather those are self
contained within the appropriate sub applications. I've started this
for couchdb-couch
[7] and couchdb-config
[8]. Here's a simple
example of the new welcome resource [9].
As you can see, there is zero database logic in the welcome request
module. In fact, I started moving all the random logic in the current
HTTP layer to a temporary module I'm calling couch_api
[10]. As you
can see from that module, it removes all the logic that was previously
nested in couch_httpd_misc_handlers
[11]. More complicated examples
for creating a database and viewing database info are in [12], and an
all dbs example is in [13]. Also I've done similar things for
couchdb-couch
as mentioned above in [8].
Easy plugin integration
As I mentioned above, by making it easy to plugin in new HTTP
endpoints, we also make it easier for plugins to do the same. On that
front I've made it so each application can optionally declare a
couch_dispatch
function describing what endpoints it can handle, and
then chttpd2
will go and find all of those to figure out how to
dispatch requests [14]. And for example, here's how the
couchdb-couch
endpoints are declared [15].
Build clustered/local API
I have not started on this front, and have only built these endpoints
for interacting with the clustered layer for simplicity as this is
just a proof of concept I hacked together. However, as I mentioned
above I've started moving all the logic out of the HTTP layer into
more appropriate places. I've made similar changes to couch-config
by moving all of the logic from [16] into the couch-config
application itself.
Why WebMachine?
I find WebMachine [5] to be one of the more interesting HTTP stacks for
building webapps. In particular I like how they have a specific flow
chart [17] and coordinate point corresponds to a particular definition
of the webmachine_decision_core:decision/1
function.
That said I think Cowboy [19] has more momentum and might be a better long term project to tie ourselves too.
Also, if we decide to go the WebMachine route, we'll need to restructure a fair bit of the current HTTP layer, making a number of breaking changes. I'm a strong -1 for coercing WebMachine into the current haphazard CouchDB API. WebMachine is very opinionated on how you structure your API (for good reason!) and I think going against that is a mistake.
So if we wanted to just do a drop in replacement of the current CouchDB API, then Cowboy is the way to go. Although one of these days we should clean up the HTTP API.
Conclusion
I hope this can start a good discussion on a game plan for the HTTP layer. Like I said, this is a proof of concept that I hacked out, so I'm not attached to the code or the use of WebMachine, but I do think it's a good representation of the ideas outlined above.
Looking forward to hearing your thoughts and comments!
Footnotes
[1] https://github.com/apache/couchdb-couch/blob/master/src/couchhttpddb.erl#L805-L823
[2] https://github.com/apache/couchdb-chttpd/blob/master/src/chttpd_db.erl#L886-L904
[3] https://github.com/apache/couchdb-chttpd/blob/master/src/chttpd_db.erl#L203-L205
[4] https://github.com/apache/couchdb-couch-mrview/blob/master/src/couchmrviewhttp.erl
[5] https://github.com/basho/webmachine
[6] https://github.com/chewbranca/chttpd2/tree/initial-branch
[7] https://github.com/apache/couchdb-couch/tree/2073-feature-webmachine-http-engine
[8] https://github.com/apache/couchdb-config/tree/2073-feature-webmachine-http-engine
[9] https://github.com/apache/couchdb-couch/blob/2073-feature-webmachine-http-engine/src/couchhttprwelcome.erl
[10] https://github.com/apache/couchdb-couch/blob/2073-feature-webmachine-http-engine/src/couch_api.erl
[11] https://github.com/apache/couchdb-couch/blob/master/src/couchhttpdmisc_handlers.erl#L32-L45
[12] https://github.com/apache/couchdb-couch/blob/2073-feature-webmachine-http-engine/src/couchhttprdb.erl
[13] https://github.com/apache/couchdb-couch/blob/2073-feature-webmachine-http-engine/src/couchhttprdbs.erl
[14] https://github.com/chewbranca/chttpd2/blob/initial-branch/src/chttpd2_config.erl#L26-L33
[15] https://github.com/apache/couchdb-couch/blob/2073-feature-webmachine-http-engine/src/couch.erl#L68-L73
[16] https://github.com/apache/couchdb-couch/blob/master/src/couchhttpdmisc_handlers.erl#L155-L249
[17] https://raw.githubusercontent.com/basho/webmachine/develop/docs/http-headers-status-v3.png
[18] https://github.com/basho/webmachine/blob/develop/src/webmachinedecisioncore.erl#L158-L595
[19] https://github.com/ninenines/cowboy