Building Druid for Cloudera 5.4.x

Posted by Kaya Kupferschmidt • Monday, November 30. 2015 • Category: Java

So the other day I wanted to investigate into using Druid as a reporting backend database. But unfortunately Druid doesn't work out of the box with Cloudera 5.4. I always get an error when running the Hadoop indexer, either via CLI or via the Indexing service. The exceptions in Hadoop always look like this:

2015-11-30 11:42:37,653 ERROR [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.VerifyError: class com.fasterxml.jackson.datatype.guava.deser.HostAndPortDeserializer overrides final method deserialize.(Lcom/fasterxml/jackson/core/JsonParser;Lcom/fasterxml/jackson/databind/DeserializationContext;)Ljava/lang/Object;
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    ...

So the problem seems to be a classical version mismatch between Cloudera Hadoop and Druid. Specifically both projects are using incompatible versions of the Jackson libraries (Cloudera still uses 2.2.3 while Druid uses 2.4.6). After some trials with different Jackson versions I got it to work by modifying the dependencies of Druid itself and building it myself. Since I suspect that others may run into similar problems, here is what I did to get Druid up and running:

git clone https://github.com/druid-io/druid.git
cd druid
git checkout 0.8.2
sed -i "s#jackson.version>2.4.6<#jackson.version>2.3.5<#" pom.xml
mvn package -DskipTests

After that you will find a packaged version of Druid at

distribution/target/druid-0.8.3-SNAPSHOT-bin.tar.gz

which should work with Cloudera 5.4. 

Setting up an Apache Cluster with Vagrant

Posted by Kaya Kupferschmidt • Wednesday, February 4. 2015 • Category: Java

Vagrant makes the perfect companion for developers that need to simulate complex cluster setups on a single machine. This is especially true when using vagrant-lxc as the container provider, which uses Linux containers instead of a full virtualisation.

Directory Structure

With the following ingredients you can setup a whole Apache Storm cluster. You can download the whole package on github. But let us look at the details. You will need the following directory structure

+ Vagrantfile
|
+----- provision
         |
         +------ data
         |        + hosts
         |
         +------ puppet
         |        |
         |        +------ manifests
         |        |        + site.pp
         |        |
         |        +------ modules
         |        + Puppetfile
         |
         +------ scripts
                  + main.sh

Continue reading "Setting up an Apache Cluster with Vagrant"

Vagrant with LXC

Posted by Kaya Kupferschmidt • Wednesday, February 4. 2015 • Category: Programming

Nowadays working with virtual machines is almost a requirement for a software-developer, especially when he is working on a web based application. The idea is that the virtual machines provides a clean environment for the application, ideally reflecing the final production environment. But true virtual machines (like VirtualBox, VMWare or KVM) come at a high cost in terms of resource usage (disk space, performance penalties, memory usage). This is especially true if you need to maintain a lot of different virtual machines, either for different projects or for a virtual cluster. If you are working on Linux (and as a developer you really should do that, except if you depend on .NET), the situation can get much better if you don't rely on a full virtualisation, but if you use Linux Containers instead.

Linux Containers (LXC)

Linux Containers (LXC) are a chroot-ed environment on steroids. Or in more detail: A Linux container provides an isolated execution environment with a separate root directory (like chroot), possibly some cgroup settings (but I am no expert in this area), dedicated virtual network interface etc. But the big difference to full virtualisation is the fact that a container still runs on the same kernel as the host os. This means that there is no performance penalty due to virtualization and you can even use a subdirectory on your normal harddisc as the new root directory for the Linux container. This implies that the contents of the Linux container are directly accessible from the host runtime, while the container can only access that specific subdirectory (or possibly some other explicitly mounted directories). More than that, because there is only one kernel instance running, all containers share the same page cache, which greatly increases the caching efficience. But I don't want to dig any deeper into LXC at this point, but of course I invite you to investigate into LXC using your favorite search engine.

Vagrant

Vagrant is a great tool for providing virtual environments for developers, including automatic provisioning and configuration. At the first glance, Vagrant looks very similar to Docker, but for me Vagrant is much more powerful, while Docker seems to excel at deploying comparable simple applications (like MySQL). But once you try to setup a virtual Hadoop cluster (where the nodes need to access a lot of network ports of other nodes), Docker doesn't look like the right tool for me. Such situation are where Vagrant really shines.

Per default Vagrant uses VirtualBox as the provider for the environments, but luckily Vagrant offers a plugin API for implementing other providers (for example AWS, KVM, libvirt, ...). For me, LXC seems to be the natural choice for Linux, and although LXC is not supported out of the box by Vagrant, Fabio Rhem has undertaken great efforts into implementing a corresponding provider.

Continue reading "Vagrant with LXC"

Apache Spark Logging

Posted by Kaya Kupferschmidt • Saturday, December 13. 2014 • Category: Programming
I just began learning about Apache Spark, a great tool for Big Data processing. But when I start the spark-shell, I get lots and lots of logging output, which is really annoying to me:
kaya@dvorak:/opt/kaya$ spark-shell 
2014-12-13 17:59:59,652 INFO  [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing view acls to: kaya
2014-12-13 17:59:59,657 INFO  [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing modify acls to: kaya
2014-12-13 17:59:59,657 INFO  [main] spark.SecurityManager (Logging.scala:logInfo(59)) - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(kaya); users with modify permissions: Set(kaya)
2014-12-13 17:59:59,658 INFO  [main] spark.HttpServer (Logging.scala:logInfo(59)) - Starting HTTP Server
2014-12-13 17:59:59,712 INFO  [main] server.Server (Server.java:doStart(272)) - jetty-8.y.z-SNAPSHOT
2014-12-13 17:59:59,736 INFO  [main] server.AbstractConnector (AbstractConnector.java:doStart(338)) - Started SocketConnector@0.0.0.0:41602
2014-12-13 17:59:59,736 INFO  [main] util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'HTTP class server' on port 41602.
Welcome to
      __              __
     / _/_  _ ___/ /__
    \ \/  \/  `/ _/  '_/
   /_/ ._/_,// //_\   version 1.1.0
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Type :help for more information.
2014-12-13 18:00:03,792 INFO  [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing view acls to: kaya
2014-12-13 18:00:03,793 INFO  [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing modify acls to: kaya
2014-12-13 18:00:03,793 INFO  [main] spark.SecurityManager (Logging.scala:logInfo(59)) - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(kaya); users with modify permissions: Set(kaya)
2014-12-13 18:00:04,193 INFO  [sparkDriver-akka.actor.default-dispatcher-2] slf4j.Slf4jLogger (Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started
2014-12-13 18:00:04,229 INFO  [sparkDriver-akka.actor.default-dispatcher-2] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting
2014-12-13 18:00:04,416 INFO  [sparkDriver-akka.actor.default-dispatcher-3] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting started; listening on addresses :[akka.tcp://sparkDriver@dvorak.ffm.dimajix.net:54519]
2014-12-13 18:00:04,418 INFO  [sparkDriver-akka.actor.default-dispatcher-4] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting now listens on addresses: [akka.tcp://sparkDriver@dvorak.ffm.dimajix.net:54519]
2014-12-13 18:00:04,425 INFO  [main] util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'sparkDriver' on port 54519.
2014-12-13 18:00:04,439 INFO  [main] spark.SparkEnv (Logging.scala:logInfo(59)) - Registering MapOutputTracker
[...]
scala> 
I searched the net for finding a hint how to get rid of all those INFO messages, but most advices quite didn't work. But finally I found a way to calm down the output of spark-shell for the current user. You need to create a file called log4j.properties (or any other name) and store it in a convenient location. I put mine into my Linux home directory /home/kaya/log4j.properties. The file should contain the following content:
# Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
That was the easy part. Now you need to tell spark-shell to actually use this file for its logging configuration. This can be done by setting the following environment variable SPARK_SUBMIT_OPTS to -Dlog4j.configuration=file:/home/kaya/log4j.properties. This can be done in bash for example by
export SPARK_SUBMIT_OPTS=-Dlog4j.configuration=file:/home/kaya/log4j.properties
I simply added the line into my .bash_profile file, such that the environment variable gets set every time I log into my computer. And now sparks starts as follows:
kaya@dvorak:/opt/kaya$ spark-shell 
Welcome to
      __              __
     / _/_  _ ___/ /__
    \ \/  \/  `/ _/  '_/
   /_/ ._/_,// //_\   version 1.1.0
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Type :help for more information.
14/12/13 18:09:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context available as sc.
Much better. Plus I now see an important warning which I need to address...

OpenGL - Web 2. 0 Style

Posted by Kaya Kupferschmidt • Monday, December 3. 2007 • Category: OpenGL
Today I stumbled over a new-to-me project on OpenGL.org: An OpenGL canvas plugin for Mozilla Firefox! The plugin called Canvas 3D only works in the new Firefix 3 line and provides an OpenGL context to Javascript.

In my eyes this opens the doors to many new exciting Web applications (if Javascript turns out to be fast enough), ranging from simple Model Viewers to some advanced online-editors. I wish I could find some time to explore some of the possibilities!

I HAS 1337 CODE. LOL!!1

Posted by Kaya Kupferschmidt • Wednesday, June 27. 2007 • Category: Programming

C#, Java and C++ are out. The upcoming star is called LOLCODE. LOLCODE is a new programming language that immitates the natural language of the leet coders.





HAI
CAN HAS STDIO?
PLZ OPEN FILE "LOLCATS.TXT"?
   AWSUM THX
     VISIBLE FILE
   O NOES
     INVISIBLE "ERROR!"
KTHXBYE

Obviously this new programming language will start a new era in software development!

A Simple Sidebar