10
Feb 10

Avoid using nulls in Scala

Scala’s handling of null’s mixed with implicit casting is quite tricky. I learned the hard way today and it took hours to figure out what was going on. First I thought it was a bug, but then someone pointed out how implicit casting effect null method parameters.

The bottom line is: DO NOT USE NULLs unless you are utilizing java libraries and have no choice. Use Option instead, with Some() or None().

The problem is best described with code…

  def checkNullOrEmpty(v:Seq[Any]):Boolean = {
    println("Class:"+v.getClass)
    return (v != null) && !v.isEmpty
  }

  case class Race(val event:String, val protocol:String) {
    println("Event:"+event+", protocol:"+protocol)
    assert(checkNullOrEmpty(event))
    assert(checkNullOrEmpty(protocol))
  }

  val t = new Race(null, null)

Pasting the above into REPL yields the following result…

Event:null, protocol:null
Class:class scala.collection.immutable.WrappedString
java.lang.NullPointerException
    at scala.Proxy$class.toString(Proxy.scala:29)
    at scala.collection.immutable.WrappedString.toString(WrappedString.scala:22)
    at scala.collection.immutable.StringLike$class.length(StringLike.scala:48)
    at scala.collection.immutable.WrappedString.length(WrappedString.scala:22)
    at scala.collection.IndexedSeqLike$class.isEmpty(IndexedSeqLike.scala:81)
    at scala.collection.immutable.WrappedString.isEmpty(WrappedString.scala:22)
    at .checkNullOrEmpty(<console>:6)
    .....

So why is NPE thrown at this breakpoint return (v != null) && !v.isEmpty?

So let’s look further into the output. When a Race instance is created, the constructor values are initialized to null. Inside the constructor we print this out and verify that values are in fact null. When We get to checkNullOrEmpty method call, the class of v is WrappedString and though the object is no longer null. In Java the call to getClass would fail, as v would be null, in Scala it’s casted (converted) to WrappedString.

This happens through Scala’s implicit conversions. The checkNullOrempty method expects a Seq. Although Seq is not a superclass or interface of String. So when we create the Race instance and specify both event and protocol as null, they are still String types with a null reference. Using say event or protocol as an argument to checkNullOrEmpty yields an implicit conversion. Why? Well, in Java the compilation would fail, since the Seq trait is not a part of the inheritance hierarchy of String, but in Scala, it succeeds, as Scala finds an implicit conversion method to convert String to WrappedString. This method is defined in Predef object. We know the Predef is imported by default into all Scala classes. Predef extends LowPriorityImplicits class, which defines this implicit conversion implicit def wrapString(s: String): WrappedString = new WrappedString(s). So basically Scala decides that the best way to convert the String type into Seq[Any] is by using this implicit conversion. So it wraps the null value with the WrappedString.

This causes two issues… First, the not null check no longer works, as the object is not null, due to the fact that it’s an instance of WrappedString, so (v != null) is true. Since that passes, it then executes the RHS of && operator and then tries to infer on !v.isEmpty, which throws the NPE, as the underlying String value wrapped is null.

I’m not necessarily sure whether this is a bug, feature, or maybe there is no real consensus on how null should be handled, but as you see this causes issues and should either be avoided through avoiding null and using Option instead. If you are using java libs that return nulls, wrapping the return value in Option might be a good idea, before proceeding any further.


06
Feb 10

Implementing bloom filter with a murmur hash function

Recently I read a blog post by Jonathan Ellis about bloom filters. Jonathan works on Cassandra and though had lots of empirical recommendations on its implementation. Cassandra uses bloom filter extensively for performance optimization. Bloom filter is a space-efficient probabilistic data structure used to test whether an element belongs to a set. It’s similar to a standard hashtable, except that its space efficient and doesn’t store the actual value, rather it hashes the keys in a bit array. The reason it’s a probabilistic data structure, is that it allows false positives, but not false negatives. This means, that to answer the question whether A is a subset of B (A ⊆ B), a bloom filter returns true or false. A false (doesn’t exist) is guaranteed to be accurate, but true (exists), has a probability of being false positive.

So why would one use such an algorithm? Say you store records on disk. Someone requests a particular record and you proceed to seek this record. This is usually an expensive operation for high throughput or systems with limited resources, though before invoking such an expensive operation you can find out if the record exists. If record does exist, you can then proceed to retrieve it through the more intensive operation. Because there is a small probability of this being a false positive, you might still find (through the resource intensive operation) that it doesn’t exist. So in an environment where you’re servicing many requests which might not exist, you can reduce the amount of expensive operations and answer such request in constant time O(1).

Jonathan’s blog post provides some great information about how to implement a very effective bloom filter. One of the most important considerations is the hash function. To lower the probability of false positives, one must use a hash function which effectively distributes the hashes across the hash-space. Jonathan recommends the use of murmur hash algorithm, which is one of the most efficient and effective hash functions, which has great performance and low collisions rate.

Another thing done to reduce hash collisions and in turn false positives, is the fact that you don’t just turn on the bits of a single hash function result, rather, you do so numerous times. (5 times is referred to a lot in literature and seems like a sweet spot). This means, that you take a key, calculate 5 hashes (using 5 different hash algorithms or a single hash algorithm strategy I’ll discuss below) and set the bit for each one of these hashes in the bit array. Answering the question of whether the key exists, does the reverse. Calculate 5 hashes and check to make sure they are all set in the bit array. If any of the 5 aren’t set, then you can be assured it doesn’t exist.

So let’s look at some code. Below is the implementation of bloom filter in Scala. It relies on a murmur hash implementation which I won’t list, but you can view/download it here.

  import scala.collection.mutable.BitSet

  class BloomFilter(capacity:Int, hashable:Hashable, hashCount:Int = 5) {

    private val buckets:BitSet = { new BitSet(capacity) }
    private val hashFunc = hashable.hashes(hashCount)(capacity) _

    def addValue(value:String) {
      hashFunc(value).foreach( buckets += _ )
    }

    def exists_?(value:String):Boolean = {
      for ( i <- hashFunc(value) ) if (!buckets.contains(i)) return false
      return true
    }
  }

  trait Hashable {
    def hashes(hashCount:Int)(max:Int)(value:String):Array[Int]
  }

  class MurmurHashable extends Hashable {
    import com.cobrio.algorithms.{MurmurHash => MH}
    def hashes(hashCount:Int)(max:Int)(value:String):Array[Int] = {
      val hash1 = MH.hash(value.getBytes, 0)
      val hash2 = MH.hash(value.getBytes, hash1)
      ( for ( i <- 0 until hashCount) yield Math.abs((hash1 + i * hash2) % max) ).toArray
    }
  }

The code above should be pretty self explanatory, but let’s just take a look at the hashing strategy. We calculate 5 hashes (default) on the key being stored, although we only ever invoke the murmur algorithm twice. Look at the highlighted lines above. Adam Kirsch and Michael Mitzenmacher wrote a paper titled, Less Hashing, Same Performance…, which shows that using a particular hashing technique which simulates additional hash functions beyond two, can increase performance of bloom filters without any significant loss in the false positive probability. To summarize the math in the paper, this is the formula: gi(x) = h1(x) + ih2(x) mod m, where m is the number of buckets in the bloom filter, h1 and h2 are the two calculated hashes respectively, and i will range from 0 up to k – 1 where k is the number of hashes we want to generate.

Here is how you’d use the above bloom filter…

  val bloom = new BloomFilter(2000, new MurmurHashable())
  bloom.addValue("Ilya Sterin")
  bloom.addValue("Elijah Sterin")

  assert(bloom.exists_?("Ilya Sterin"))
  assert(bloom.exists_?("Elijah Sterin"))
  assert(!bloom.exists_?("Don't Exist"))

05
Feb 10

NOSQL Databases for web CRUD (CouchDB) – Shows/Views

There are many applications that easily land themselves to CRUD paradigm. Even if only 80% of the application’s functionality is pure CRUD, one can benefit from a simpler storage model. For so many years many (including myself) thought that storing application state which needs to be used in an enterprise-grade applications, meant we had one option, an RDBMS. Not that other models weren’t available, but their prevalence was not as high which made one question the quality/stability and long term health of such software. So we’ve gotten accustomed to approaching state persistence by sticking everything into one hole. If it didn’t fit, we trimmed it, cut it, squeezed it, stumped on it, but we made it fit. Then when it was time to pull out, ah, we repeated the procedure. ORMs are one of the most popular remedies for such procedures. But if one has ever developed a complex data model and actually took the time to think about the data access strategy, on both ends, application and RDBMS, you’d quickly run into many limitations of the ORM model. I guess you can abstract away the impedance mismatch only so much, but watch out for these leaky abstractions. So if you’re still rusty on your SQL and the relational theory because the great Gods promised that you’ll never have to worry about it if you use ORM, you better get to learning, unless you’re planning maintaining a ToDo List application for the rest of your life.

So, with this out of the way, let’s talk about real data persistence. So there are many applications (especially web applications), that don’t lend themselves very well into the relational persistence model. There are many reasons for this, but those who have ever had to beat their heads against the wall to bend the relational model to persist data know what I mean. By the time you’re done, you’re using and RDBMS to store non-relational data and all the benefits of the relational model are moot. You might as well store your data in an excel spreadsheet. So what are some of these reasons?

  1. Highly dynamic structure (relational schemas are rather static, if you’re doing it the relational way (no tall/skinny tables))
  2. Data model is not very relational. That speaks for itself, but many don’t really know when and how to identify this, as we’ve spend so much time identifying relations that don’t exist or are irrelevant to the application.
  3. Your relational schema is denormalized to the point where you’re no longer benefiting from relational database features like enforcing consistency and reducing redundancy in the data.
  4. Your relational database is bending backwards to accommodate your read/write throughput even after you denormalized (which itself is a reason to look elsewhere) and optimized, forcing you to continuously have to scale up to allow for increased load.
  5. You continuously marshall/unmarshall data (ORM???) to persist it and then to transform it to another data format.

Touching a bit more on bullet #5; Lots of software is written using good OO principles of encapsulation. Encapsulation is the heart of software abstractions and is probably the most important principle. But people tend to abuse it. Abstractions are good when they add value, but marshalling and unmarshalling one data structure into another without a good apparent reason, other than you don’t have to learn how to deal with a particular data structure, I’m not sure is such a good reason. So many software projects use ORM for the sake of not learning SQL, but how far can you actually get? ORM is a perfect example of a leaky abstraction. So many projects retrieve data from a web view in JSON or url-encoded format and marshall that into objects, only to validate the data and persist it using an ORM. So now you’ve unmarshalled the data from JSON to an object graph to just then marshall it again into a SQL query to send to the database. Do we really need these superfluous steps?

I’m sure there are other reasons I haven’t mentioned here. These reasons I personally faced when making my decisions.

A rational way of deciding on data persistence is not to automatically start writing a DDL script or grab your ER diagram tool, but rather look at what data you have, how would this data persist in a “natural” way, how does the client software need to access this data, what are the performance/scalability considerations and then go out and look at different persistence models to find the best match.

In my latest project, I had to think about a way to persist hierarchical data. This data will be accessed through some web medium (browser, mobile client, etc…) majority of the time. One of the web interfaces will be an ajax enabled web app, another will be an iphone and/or adroid app. JSON is communication protocol lingua franca of the web. Some will argue it’s XML, but I’ll keep my XML opinions to myself at this point.

CouchDB is a document database which one would call key/value store. It allows for storage of JSON documents that are uniquely identified by keys. Sounds interesting, not really. There are tens of other databases that have same capabilities, so why CouchDB? Well in one short sentence: CouchDB is build one the web and for the web. So what does that really mean? Well, besides the JSON storage structure and its innate ability to scale horizontally, they’ve build some pretty awesome features that makes it very appealing for a particular type of an application. The task is to decide whether the application you’re building is that application. So not to make this post any longer than it already is with my rant, let me describe and demonstrate some of the features that I’ve used over the last few days and why they are relevant.

Please make sure you have couchdb 0.10.* version installed as well as curl command line utility. For installation instructions see http://wiki.apache.org/couchdb/Installation

Once couchdb is installed, you can start it using the couchdb command. Depending on your setup, you can run the following command…

sudo couchdb

A little bit of a background though before we get any further…

We’re going to store hierarchical data, which JSON is a natural fit for. One of the issues we have, is that in our industry, there are numerous data standards and they are all defined either in XML or in some delimited rectangular format. One major use case involves performing CRUD operations on the data from variety of sources (web app, mobile app, etc…) as well as being able to emit this data in one of the industry standard formats for integration purposes.

CouchDB exposes a RESTful API, so it’s rather easy to use it from any language which supports HTTP. Most popular languages have abstraction libraries on top of that, to abstract away the HTTP abstraction. Here is a list of available clients: http://wiki.apache.org/couchdb/Basics. For our purposes we’re going to use curl, a command line utility which allows us to make HTTP requests. So let’s see how we can easily accomplish this with CouchDB.

Now that CouchDB is successfully running, let’s create a database and insert some sample data…

curl -X PUT “http://localhost:5984/sample_db”

Above line create a database called sample_db. If the command is successfule, you will see the following output: {“ok”:true}. Now lets add three files to this database. The JSON data files which we’re sending below are found in code snippets below labeled accordingly, so make sure they are in the directory from which you’re running the below commands.

curl -X PUT -d @rec1.json “http://localhost:5984/sample_db/record1” curl -X PUT -d @rec2.json “http://localhost:5984/sample_db/record2” curl -X PUT -d @rec3.json “http://localhost:5984/sample_db/record3”

Again, each command should yield a JSON response with “ok” set to true if the add succeeded. Here is what one would expect from the first command: {“ok”:true,”id”:”record1″,”rev”:”1-7c15e9df17499c994439b5e3ab1951d2″}. Again, ok is set to true making this a success response. The id field is set to the name of the record which we created. You can see that names are set through the URL as they are just resources in the world of REST. The rev field displays the revision of this document. CouchDB’s concurrency model is based on MVCC, though it versions the documents as it updates them, so each document modification gets it’s unique revision id. You can read more about this in CouchDB’s architecture and API documentation.

rec1.json

  {
    "name": "John Doe",
    "date": "2001-01-03T15:14:00-06:00",
    "children": [
      {"name": "Brian Doe", "age": 8, "gender": "Male"},
      {"name": "Katie Doe", "age": 15, "gender": "Female"}
    ]
  }

rec2.json

  {
    "name": "Ilya Sterin",
    "date": "2001-01-03T15:14:00-06:00",
    "children": [
      {"name": "Elijah Sterin", "age": 10, "gender": "Male"}
    ]
  }

rec3.json

  {
    "name": "Emily Smith",
    "date": "2001-01-03T15:14:00-06:00",
    "children": [
      {"name": "Mason Smith", "age": 3, "gender": "Male"},
      {"name": "Donald Smith", "age": 2, "gender": "Male"}
    ]
  }

Now that we have the data persisted, let’s talk about some strategies for getting the data out.

CouchDB supports views. They are used to query and report on the data stored in the database. Views can be permanent, meaning they are stored in CouchDB as named queries and are accessed through their name. Views can also be temporary, meaning they are executed and discarded. CouchDB computes and stores view indexes, so view operations are very efficient and can theoretically (and I believe practically) span across remote nodes. Views are written as map/reduce operations, though they land themselves well for distribution. Here is an example of a map function in a view. (Reduce functions are optional if your query requires aggregating result sets)

  function(doc) {
    if (doc.name == "Ilya Sterin") {
      emit(null, doc);
    }
  }

There are other two really cool features, which allow more effective data filtering and transformation. These features are shows and lists. The purpose of shows and lists is to render a JSON document in a different format. Shows allow to transform a single document into another format. A show is similar to a view function, but it takes two parameters function(doc, req), doc is the document instance being iterated and request is an abstraction over CouchDB request object. Here is a simple show function…

  function(doc, req) {
    var person = <person />;
    person.@name = doc.name;
    person.@joined = doc.date;
    person.children = <children />;
    if (doc.children) {
      for each (var chldInst in doc.children) {
        var child = <child />;
        child.text()[0] = chldInst.name;
        child.@age = chldInst.age;
        child.@gender = chldInst.gender;
        person.children.appendChild(child);
      }
    }
    return {
      'body': person.toXMLString(),
      'headers': {
        'Content-Type': 'application/xml'
      }
    }
  }

The xml function and inlines you see here is the e4x which adds native support as a part of ECMAScript and is implemented in the embedded javascript engine Spidermonkey, which CouchDB uses.

This show function, takes a particular JSON record and turns it into XML. Creating a show is pretty simple, you just encapsulate the function above into a design document and create the record through PUT.

Here is the design document for the show above…

xml_show.json

  {
    "shows": {
      "toxml": "Here you inline the show function above.  Make sure all double quotes are escaped..."
    }
  }

Once you have the design document, create it…

curl -X PUT -H “Content-Type: application/json” -d @xml_show.json “http://localhost:5984/sample_db/_design/shows”

Note: in (…./_design/shows), shows is just a name of the design document, you can call it what ever you want

Now let’s invoke the show

curl -X GET “http://localhost:5984/sample_db/_design/shows/_show/toxml/record1”

Here is the output

<person name="John Doe" joined="2001-01-03T15:14:00-06:00">
  <children>
    <child age="8" gender="Male">Brian Doe</child>
    <child age="15" gender="undefined">Katie Doe</child>
  </children>
</person>

So, that was super easy, we stored our document which required no code on our behalf and then we retrieved it with minimal effort by using ECMAScript’s e4x facilities.

So how would I transform a record collection or view results into a different format? Well, this is where lists come in. Lists are similar to shows, but they are applied to the results of an already present view. Here is a sample list function.

  function(head, req) {
    start({'headers': {'Content-Type': 'application/xml'}});
    var people = <people/>;
    var row;
    while (row = getRow()) {
      var doc = row.value;
      var person = <person />;
      person.@name = doc.name;
      person.@joined = doc.date;
      person.children = <children />;
      if (doc.children) {
        for each (var chldInst in doc.children) {
          var child = <child />;
          child.text()[0] = chldInst.name;
          child.@age = chldInst.age;
          child.@gender = chldInst.gender;
          person.children.appendChild(child);
        }
      }
      people.appendChild(person);
    }
    send(people.toXMLString());
  }

Again, you encapsulate this list function into a design document, along with a simple view function…

xml_list.json

  {
    "views": {
      "all": {
        "map": "function(doc) { emit(null, doc); }"
      }
    },
    "lists": {
      "toxml": "Here you inline the show function above.  Make sure all double quotes are escaped as it must be stringified due to the fact that JSON can't store a function type."
    }
  }

Now, we create the design document

curl -X PUT -H “Content-Type: application/json” -d @xml_list.json “http://localhost:5984/sample_db/_design/lists”

Once the design document is created, we can request our xml document listing all person records

curl -X GET http://localhost:5984/sample_db/_design/lists/_list/toxml/all

And the output is

  <people>
    <person name="John Doe" joined="2001-01-03T15:14:00-06:00">
      <children>
        <child age="8" gender="Male">Brian Doe</child>
        <child age="15" gender="Female">Katie Doe</child>
      </children>
    </person>
    <person name="Ilya Sterin" joined="2001-01-03T15:14:00-06:00">
      <children>
        <child age="10" gender="Male">Elijah Sterin</child>
      </children>
    </person>
    <person name="Emily Smith" joined="2001-01-03T15:14:00-06:00">
      <children>
        <child age="3" gender="Male">Mason Smith</child>
        <child age="2" gender="Male">Donald Smith</child>
      </children>
    </person>
  </people>

So you can see how shows and lists are very useful and provide a convenient way to transform views into different formats.

As you can see, we created the database and stored data. No code was required to make that happen, just collect the data through your application and make a CouchDB REST request. We also added some custom functionality of transforming the data for multiple client consumption by using shows and views. In my opinion, CouchDB is a great step towards what one could call web/cloud scale database. It has awesome abilities to integrate with web technologies and it can scale to support the ever increasing web scale of data. In other words, it fits some application models like a glove.

I barely even scraped the tip of the iceberg of what CouchDB can do. We haven’t talked about result aggregates which can be achieved with map/reduce, we also haven’t discussed data validation and security. These features might be a top of some future posts.


03
Feb 10

The start of the Scala journey (concurrency and idiomatic Scala rant)

I’ve been following Scala off and on for about 2 years now. Mostly in spurts, I liked the language, but due to the workload and other priorities I never had the time to take it for a full ride. Well, over the last 2 weeks, I decided to take the full plunge. Full meaning, I’m taking a highly concurrent production application which power’s a very critical component of our application, and rewriting it in Scala. I’m doing this for more than just fun. This application has grown from a very cleanly architected one, to one that is still rather nicely designed, but has accumulated a lot of technical debt. With everything I’ve learned about Scala, I think I can redesign it to be cleaner, more concise, and probably more scalable. The other big driving reason I’m looking to give Scala a shot, is due to its Actor based concurrency. I’ve worked with Java’s threading primitives for many years and have accumulated a love/hate relationship. The JSE 5 concurrency package brought some nice gems to my world, but it didn’t eliminate the fact that you’re still programming to the imperative model of shared state synchronization. Scala actors hide some of the ugliness of thread synchronization, though don’t eliminate the issue completely. Due to the nature of Scala, being a mix between imperative and functional language and the fact that actors are implemented as a library, nothing stops one from running into same issues as in more primitive thread state-sharing operations (i.e. race conditions, lock contentions, deadlocks/livelocks). Basically, if you’re using actors as just an abstraction layer over old practices, you’ll be in the same boat as you started with Java. With all of that said, unlike Java, Scala provides you the facilities for designing cleaner and more thread safe systems due to its functional programming facilities. Mutable shared state is the leading cause of non-determinism in Java concurrent applications, so immutability and message passing is a way into a more deterministic world.

I’ve also looked at other concurrent programming models, like STM/MVCC. STM is the basis of concurrent programming in Clojure and it’s a different paradigm than Actors. STM is a simpler model if you’re used to programming the old imperative threading, as they abstract you from concurrency primitives by forcing state modifications to occur in a transactional context. When this occurs, the STM system takes care of ensuring the state modifications occur atomically and in isolation. In my opinion this system suites the multi-core paradigm very well and allows smoother transition, the problem with it, at least in the context of MVCC, is that for each transaction and data structure being modified, a copy is made for the purposes of isolation (implementation of copying is system dependent, some might be more efficient than others), but you can already see an issue. For a system that has to handle numerous concurrent transactions involving many objects, this can become a bottleneck and the creation of copies can overburden system’s memory and performance. There are some debates about that in the STM world, mostly involving finding the sweet spot for such systems, where the cost of MVCC is less relevant the the cost of constant synchronization through locking.

Actors model is different, it works in terms of isolated objects (actors), all working in isolation by message passing. None can modify or query the state of another, short of requesting such an operation by sending a message to that particular object (actor). In Scala, you can break that model, as you can send around mutable objects, but if you are to really benefit from the Actor model, one should probably avoid doing that. Actors lend themselves better to concurrent applications, that not only span multiple-cores, but also can easily be scaled to multiple physical nodes. Because messages being passed are immutable data structures that can be easily synchronized and shared, the underlying Actor system can share these message across physically dispersed actors just as it can for the actors within the same physical memory space.

So the world of concurrency is getting more exciting with these awesome paradigms. One thing to remember is that there is no one size fits all concurrency model and I don’t see any one of the above becoming the de-facto standard any time soon. There is a sweet spot for each, so one should learn the ins and outs of each model.

Now that I got the concurrency out of the way, let’s get back to the actual syntax of Scala. Scala is very powerful (at least compared to Java). This power comes with responsibility. You can use Scala to write beautiful/concise programs, or you can use it to write obscure/illegible programs that no one, including the original author, will be able to comprehend. Personally, I prefer and can responsible handle this responsibility. I’m a long time Perl programmer (way before I started programming Java), and I’ve seen (and even written at times), programs that Larry Wall himself wouldn’t be able to comprehend.

Scala comes with operator overloading, but when not judiciously used, that power alone can be responsible for ineligibility of any system. This is one of the major reasons why languages like Java decided to not include it. Personally, I think operator overloading can be a beautiful addition to any API. It can make writing DSLs easier and using them more natural. Again, this power is great in the use of experienced and responsible programmers.

After having experience great power (Perl) and great restraint (Java), I’m leaning more towards power (who wouldn’t :-). One one hand, it’s nice to be able to read and comprehend anyone’s Java program, even when it’s not nicely written, on the other hand, it’s a pain trying to write a program and jumping through all the hoops and limitations because of the various constraints. In a perfect AI world, the compiler would infer the capabilities of the programmer and restrict its facilities based on those, in some way as to not offend anyone:-) So if a novice is inferred, ah, there goes the operator overloading and implicit conversions, etc… But for now, I’d rather have a powerful tool to use when I write software and Scala seems to push the right buttons for me at this point.

I’m going to start of a list of posts, starting with this one, about my experiences with Scala.

Here is a little something I came up with a few hours ago. Our software has some limited interoperability with a SQL database and requires a light abstraction. We chose not to use any 3rd party ORM or SQL abstraction, mostly due to the fact that the dependency on these abstractions don’t really provide any benefit for our limited use of SQL. So I developed a simple SQL variant abstraction layer, which allows us to execute SQL queries which are defined in the SQLVariant implementation. Moving from one database to another, just requires one to implement a SQLVariant interface to provide the proper abstraction. I initially wrote this in java and although it was decent, it required quite a bit more code and didn’t look as concise as I wanted it. One issue was PreparedStatement and it’s interface for placeholder bindings. How would one bind java’s primitive and wrapper types as placeholders and how would the SQLVariant know which PreparedStatement.bind* method to call? I resorted to using an enumeration which defines these operations and reflection for the purpose of invoking these operations. I’m basically sidestepping static typing in a place I’m not sure I really want or have to. Here is the java implementation.

I got rid of a few methods, specifically dealing with resultset, statement, and connection cleanup, as they don’t really emphasize my point here.

  import java.lang.reflect.Method;
  import java.sql.*;
  import java.util.ArrayList;
  import java.util.Collections;
  import java.util.List;

  public abstract class SqlVariant {

    public abstract SqlSelectStatement getResultsNotYetNotifiedForStatement(NotificationType... types);

    public abstract SqlSelectStatement getResultsNotYetNotifiedForStatement(int limit, NotificationType... types);

    public abstract SqlUpdateStatement getUpdateWithNotificationsForStatement(Result result);

    private abstract static class SqlStatement<T> {

      protected String sql;
      protected List<BindParameter> bindParams = new ArrayList<BindParameter>();
      protected PreparedStatement stmt;

      public SqlStatement(String sql) {
        this.sql = sql;
      }

      public SqlStatement addBindParam(BindParameter param) {
        bindParams.add(param);
        return this;
      }

      public String getSql() {
        return sql;
      }

      public List<BindParameter> getBindParams() {
        return Collections.unmodifiableList(bindParams);
      }

      protected PreparedStatement prepareStatement(Connection conn) throws SQLException {
        PreparedStatement stmt = conn.prepareStatement(sql);
        for (int bindIdx = 0; bindIdx < bindParams.size(); bindIdx++) {
          BindParameter p = bindParams.get(bindIdx);
          try {
            Method m = stmt.getClass().getMethod(p.type.method, Integer.TYPE, p.type.clazz);
            m.invoke(stmt, bindIdx + 1, p.value);
          }
          catch (Exception e) {
            throw new RuntimeException("Couldn't execute method: " + p.type.method + " on " + stmt.getClass(), e);
          }
        }
        return stmt;
      }

      public abstract T execute(Connection conn) throws SQLException;
    }

    public static final class SqlSelectStatement extends SqlStatement<ResultSet> {

      public SqlSelectStatement(String sql) {
        super(sql);
      }

      @Override
      public ResultSet execute(Connection conn) throws SQLException {
        return prepareStatement(conn).executeQuery();
      }
    }

    public static final class SqlUpdateStatement extends SqlStatement<Boolean> {
      public SqlUpdateStatement(String sql) {
        super(sql);
      }

      @Override
      public Boolean execute(Connection conn) throws SQLException {
        stmt = prepareStatement(conn);
        return stmt.execute();
      }
    }


    public static final class BindParameter<T> {
      private final BindParameterType type;
      private final T value;

      public BindParameter(Class<T> type, T value) {
        this.type = BindParameterType.getTypeFor(type);
        this.value = value;
      }

      public BindParameter(BindParameterType type, T value) {
        this.type = type;
        this.value = value;
      }
    }

    private static enum BindParameterType {
      STRING(String.class, "setString"),
      INT(Integer.TYPE, "setInt"),
      LONG(Long.TYPE, "setLong");

      private Class clazz;
      private String method;

      private BindParameterType(Class clazz, String method) {
        this.clazz = clazz;
        this.method = method;
      }

      private static BindParameterType getTypeFor(Class clazz) {
        for (BindParameterType t : BindParameterType.values()) {
          if (t.clazz.equals(clazz)) {
            return t;
          }
        }
        throw new IllegalArgumentException("Type: " + clazz.getClass() + " is not defined as a BindParameterType enum.");
      }
    }
  }

Now, here is how one would implement the SQLVariant interface. The below implementation is in groovy. I choose groovy when I have to do lots of string interpolation, which somehow java and scala refuse to support. The code was shortened to just demonstrate the bare minimum.

  class MySqlVariant extends SqlVariant {

    @Override
    public SqlVariant.SqlSelectStatement getResultsNotYetNotifiedForStatement(int limit, NotificationType[] types) {
      SqlVariant.SqlSelectStatement stmt = new SqlVariant.SqlSelectStatement("SELECT ...")
      for (NotificationType t : types)
        stmt.addBindParam(new SqlVariant.BindParameter(String.class, t.name().toUpperCase()))
      return stmt;
    }

    @Override
    public SqlVariant.SqlUpdateStatement getUpdateWithNotificationsForStatement(Result result) {
      SqlVariant.SqlUpdateStatement stmt = new SqlVariant.SqlUpdateStatement("INSERT INTO ....")
      result.notifications?.each { Notification n ->
        stmt.addBindParam(new SqlVariant.BindParameter(SqlVariant.BindParameterType.LONG, n.id))
        stmt.addBindParam(new SqlVariant.BindParameter(SqlVariant.BindParameterType.LONG, result.intervalId))
      }
      return stmt
    }

    ......
  }

I started reimplementing the above in Scala and I ran across a very powerful and beautiful Scala implicit conversion feature. This allowed me to truly abstract the SQLVariant implementations from any bindings specific knowledge, through an implicit casting facility that normally only dynamically typed languages provide. Scala gives us this ability, but also ensures static type safety of implicit conversions during compilation.

Another wonderful feature, is lazy vals, which allows us to cleanly implement lazy evaluation that we (java programmers) are so used to doing by instantiating a member field as null and then checking it before initializing on the initial accessor call. If you’ve seen code similar to below a lot, you’ll rejoice to find out that you no longer have to do this in Scala.

public class SomeClass {
  private SomeType type;

  public SomeType getSomeType() {
    if (type == null) type = new SomeType(); // Often more complex than that
    return type;
  }
}

The above, besides not being ideal, is also error prone if say a type is used anywhere else in SomeClass and you don’t use the accessor method to retrieve it. You must ensure the use of accessor through convention or deal with the fact that it could be non-instantiated. This is no longer the case in Scala as its runtime handles lazy instantiation for you. See below code.

Note: I still allow the client data access abstractions to work with a raw jdbc ResultSet returned from the SQLVariant. I don’t see this as an issue at this point, first since these abstractions are SQL specific and also because ResultSet is a standard interface for any JDBC SQL interaction. Here is my concise Scala implementation. I’m still learning, so this might change as I get more familiar with Scala idioms and start writing more idiomatic Scala code.

  import javax.sql.DataSource
  import java.sql.{ResultSet, Connection, PreparedStatement}
  import com.bazusports.chipreader.sql.SqlVariant.{SqlSelectStatement, BindingValue}

  abstract class SqlVariant(private val ds: DataSource) {

    def retrieveConfigurationStatementFor(eventTag: String): SqlSelectStatement;

    protected final def connection: Connection = ds.getConnection
  }

  object SqlVariant {

    trait BindingValue {def >>(stmt: PreparedStatement, idx: Int): Unit}

    // This is how implicit bindings happen.  This is beauty, we can now
    // bind standard types and have the compiler perform implicit conversions
    implicit final def bindingIntWrapper(v: Int) = new BindingValue {
      def >>(stmt: PreparedStatement, idx: Int) = {stmt.setInt(idx, v)}
    }

    implicit final def bindingLongWrapper(v: Long) = new BindingValue {
      def >>(stmt: PreparedStatement, idx: Int) {stmt.setLong(idx, v)}
    }

    implicit final def bindingStringWrapper(v: String) = new BindingValue {
      def >>(stmt: PreparedStatement, idx: Int) {stmt.setString(idx, v)}
    }

    abstract class SqlStatement[T](conn: Connection, sql: String, params: BindingValue*) {

      // Ah, another beautiful feature, lazy vals.  Basically, it's
      // evaluated on initial call.  This is great for the
      // so common lazy memoization technique, of checking for null.
      protected lazy val statement: PreparedStatement = {
        val stmt:PreparedStatement = conn.prepareStatement(sql)
        params.foreach((v) => v >> (stmt, 1))
        stmt
      }

      def execute(): T
    }

    class SqlUpdateStatement(conn: Connection, sql: String, params: BindingValue*)
            extends SqlStatement[Boolean](conn, sql, params: _*) {
      def execute() = statement.execute()
    }

    class SqlSelectStatement(conn: Connection, sql: String, params: BindingValue*)
            extends SqlStatement[ResultSet](conn, sql, params: _*) {
      def execute() = statement.executeQuery()
    }
  }

  /* Implementation of the SQLVariant */

  class MySqlVariant(private val dataSource:DataSource) extends SqlVariant(dataSource) {

    def retrieveConfigurationStatementFor(eventTag: String) =
      new SqlSelectStatement(connection,  "SELECT reader_config FROM event WHERE tag = ?", eventTag)

  }

And the obligatory unit test using the o’ so awesome Scala Specs framework.

  object MySqlVariantSpec extends Specification {
    val ds = getDataSource();

    "Requesting a configuration statement for a specific event" should {
      "return a SqlSelectStatement with properly bound parameters" in {
        val sqlVariant:SqlVariant = new MySqlVariant(ds)
        val stmt:SqlSelectStatement = sqlVariant.retrieveConfigurationStatementFor("abc")
        stmt must notBeNull
        // .... Other assertions go here
      }
    }
  }

Although I barely scraped the tip of the iceberg, I hope this helps you see some of what Scala has to offer. More to come as I progress.


05
Jan 10

Annoying javax.net.ssl.SSLHandshakeException exception

This exception has to be the most annoying one I’ve faced over the years with Java. I’m not sure which moron’s wrote the SSL library, but did they think about providing an option to disable ssl certificate validation? I wasn’t sure it was a requirement to have a valid certificate. I mean sure, it’s nice and provides that worm fuzzy security feeling, but when I’m developing and/or testing, can you please provide some way to disable this annoying thing? Either way, I dug into this today and figured it out. It’s actually as anything else in standard JDK, 100+ lines of code which they could of provided out of the box a simple boolean switch, instead your have to implement factories, interfaces, etc… WTF? Just to turn off certificate validation? Talk about over-engineering stuff.

So here is the code, which you can copy and paste into your project, instructions on how to use it are below…

import org.apache.commons.httpclient.protocol.Protocol;
import org.apache.commons.httpclient.protocol.ProtocolSocketFactory;

import javax.net.ssl.SSLContext;
import javax.net.ssl.TrustManager;
import javax.net.ssl.X509TrustManager;

import java.io.IOException;
import java.net.InetAddress;
import java.net.InetSocketAddress;
import java.net.Socket;
import java.net.SocketAddress;
import java.net.UnknownHostException;

import javax.net.SocketFactory;

import org.apache.commons.httpclient.ConnectTimeoutException;
import org.apache.commons.httpclient.HttpClientError;
import org.apache.commons.httpclient.params.HttpConnectionParams;
import org.apache.commons.httpclient.protocol.SecureProtocolSocketFactory;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

public static class TrustAllSSLProtocolSocketFactory implements ProtocolSocketFactory {

    public static void initialize() {
        Protocol.registerProtocol("https", new Protocol("https", new TrustAllSSLProtocolSocketFactory(), 443));
    }

    private SSLContext sslcontext = null;

    private static TrustManager trustAllCerts =
            new X509TrustManager() {
                public java.security.cert.X509Certificate[] getAcceptedIssuers() { return null; }
                public void checkClientTrusted( java.security.cert.X509Certificate[] certs, String authType) {}
                public void checkServerTrusted(java.security.cert.X509Certificate[] certs, String authType) {}
            };

    /**
     * Constructor for TrustAllSSLProtocolSocketFactory.
     */
    private TrustAllSSLProtocolSocketFactory() {
        super();
    }

    private static SSLContext createSSLContext() {
        try {
            SSLContext context = SSLContext.getInstance("SSL");
            context.init(null, new TrustManager[]{trustAllCerts}, null);
            return context;
        } catch (Exception e) {
            throw new HttpClientError(e.toString());
        }
    }

    private SSLContext getSSLContext() {
        if (this.sslcontext == null) {
            this.sslcontext = createSSLContext();
        }
        return this.sslcontext;
    }

    public Socket createSocket(String host, int port, InetAddress clientHost, int clientPort)
            throws IOException, UnknownHostException {
        return getSSLContext().getSocketFactory().createSocket(host, port, clientHost, clientPort);
    }


    public Socket createSocket(final String host, final int port, final InetAddress localAddress,
                               final int localPort, final HttpConnectionParams params
    ) throws IOException, UnknownHostException, ConnectTimeoutException {
        if (params == null) throw new IllegalArgumentException("Parameters may not be null");
        int timeout = params.getConnectionTimeout();
        SocketFactory socketfactory = getSSLContext().getSocketFactory();
        if (timeout == 0) return socketfactory.createSocket(host, port, localAddress, localPort);
        else {
            Socket socket = socketfactory.createSocket();
            SocketAddress localaddr = new InetSocketAddress(localAddress, localPort);
            SocketAddress remoteaddr = new InetSocketAddress(host, port);
            socket.bind(localaddr);
            socket.connect(remoteaddr, timeout);
            return socket;
        }
    }

    public Socket createSocket(String host, int port) throws IOException, UnknownHostException {
        return getSSLContext().getSocketFactory().createSocket(host, port);
    }

    public Socket createSocket(Socket socket, String host, int port, boolean autoClose)
            throws IOException, UnknownHostException {
        return getSSLContext().getSocketFactory().createSocket(socket, host, port, autoClose);
    }

    public boolean equals(Object obj) {
        return ((obj != null) && obj.getClass().equals(TrustAllSSLProtocolSocketFactory.class));
    }

    public int hashCode() {
        return TrustAllSSLProtocolSocketFactory.class.hashCode();
    }
}

Now all you have to do is call TrustAllSSLProtocolSocketFactory.initialize() anywhere in your application initialization code or right before you access any https resources, either through the URL class or through any other library, like HttpClient.

Hope this helps, though it’s still a pretty ugly hack IMO.

Page 6 of 13« First...45678...Last »