Building Druid for Cloudera 5.4.x

Posted by Kaya Kupferschmidt • Monday, November 30. 2015 • Category: Java

So the other day I wanted to investigate into using Druid as a reporting backend database. But unfortunately Druid doesn't work out of the box with Cloudera 5.4. I always get an error when running the Hadoop indexer, either via CLI or via the Indexing service. The exceptions in Hadoop always look like this:

2015-11-30 11:42:37,653 ERROR [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.VerifyError: class com.fasterxml.jackson.datatype.guava.deser.HostAndPortDeserializer overrides final method deserialize.(Lcom/fasterxml/jackson/core/JsonParser;Lcom/fasterxml/jackson/databind/DeserializationContext;)Ljava/lang/Object;
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(

So the problem seems to be a classical version mismatch between Cloudera Hadoop and Druid. Specifically both projects are using incompatible versions of the Jackson libraries (Cloudera still uses 2.2.3 while Druid uses 2.4.6). After some trials with different Jackson versions I got it to work by modifying the dependencies of Druid itself and building it myself. Since I suspect that others may run into similar problems, here is what I did to get Druid up and running:

git clone
cd druid
git checkout 0.8.2
sed -i "s#jackson.version>2.4.6<#jackson.version>2.3.5<#" pom.xml
mvn package -DskipTests

After that you will find a packaged version of Druid at


which should work with Cloudera 5.4. 

Setting up an Apache Cluster with Vagrant

Posted by Kaya Kupferschmidt • Wednesday, February 4. 2015 • Category: Java

Vagrant makes the perfect companion for developers that need to simulate complex cluster setups on a single machine. This is especially true when using vagrant-lxc as the container provider, which uses Linux containers instead of a full virtualisation.

Directory Structure

With the following ingredients you can setup a whole Apache Storm cluster. You can download the whole package on github. But let us look at the details. You will need the following directory structure

+ Vagrantfile
+----- provision
         +------ data
         |        + hosts
         +------ puppet
         |        |
         |        +------ manifests
         |        |        + site.pp
         |        |
         |        +------ modules
         |        + Puppetfile
         +------ scripts

Continue reading "Setting up an Apache Cluster with Vagrant"

Vagrant with LXC

Posted by Kaya Kupferschmidt • Wednesday, February 4. 2015 • Category: Programming

Nowadays working with virtual machines is almost a requirement for a software-developer, especially when he is working on a web based application. The idea is that the virtual machines provides a clean environment for the application, ideally reflecing the final production environment. But true virtual machines (like VirtualBox, VMWare or KVM) come at a high cost in terms of resource usage (disk space, performance penalties, memory usage). This is especially true if you need to maintain a lot of different virtual machines, either for different projects or for a virtual cluster. If you are working on Linux (and as a developer you really should do that, except if you depend on .NET), the situation can get much better if you don't rely on a full virtualisation, but if you use Linux Containers instead.

Linux Containers (LXC)

Linux Containers (LXC) are a chroot-ed environment on steroids. Or in more detail: A Linux container provides an isolated execution environment with a separate root directory (like chroot), possibly some cgroup settings (but I am no expert in this area), dedicated virtual network interface etc. But the big difference to full virtualisation is the fact that a container still runs on the same kernel as the host os. This means that there is no performance penalty due to virtualization and you can even use a subdirectory on your normal harddisc as the new root directory for the Linux container. This implies that the contents of the Linux container are directly accessible from the host runtime, while the container can only access that specific subdirectory (or possibly some other explicitly mounted directories). More than that, because there is only one kernel instance running, all containers share the same page cache, which greatly increases the caching efficience. But I don't want to dig any deeper into LXC at this point, but of course I invite you to investigate into LXC using your favorite search engine.


Vagrant is a great tool for providing virtual environments for developers, including automatic provisioning and configuration. At the first glance, Vagrant looks very similar to Docker, but for me Vagrant is much more powerful, while Docker seems to excel at deploying comparable simple applications (like MySQL). But once you try to setup a virtual Hadoop cluster (where the nodes need to access a lot of network ports of other nodes), Docker doesn't look like the right tool for me. Such situation are where Vagrant really shines.

Per default Vagrant uses VirtualBox as the provider for the environments, but luckily Vagrant offers a plugin API for implementing other providers (for example AWS, KVM, libvirt, ...). For me, LXC seems to be the natural choice for Linux, and although LXC is not supported out of the box by Vagrant, Fabio Rhem has undertaken great efforts into implementing a corresponding provider.

Continue reading "Vagrant with LXC"

Behaviour of sudo on Ubuntu and Fedora

Posted by Kaya Kupferschmidt • Wednesday, February 4. 2015 • Category: Linux

I often use the Linux command sudo for performing some administrative tasks on my machine or to run otherwise restricted programs. But sometimes I observe strange behaviour, when the program under sudo-control tries to access some files in the users home. The location of the home directory of the current user is stored in an environment variable named HOME, so it is interesting to see how this variable is defined under sudo. First let's try Fedora 21:

kaya@fedora:~$ env | grep HOME

kaya@fedora:~$ sudo env | grep HOME

So when I run a program with sudo, on Fedora 21, the home directory of the user root will be used. Let us check the user names stored in the environment:

kaya@fedora:~$ env | grep USER

kaya@fedora:~$ sudo env | grep USER

So this means that on Fedora 21, the environment variable USER will also change to root, but USERNAME remains unchanged and reflects the original user.

Now let us try the same on Ubuntu 14.04

kaya@ubuntu:~$ env | grep HOME

kaya@ubuntu:~$ sudo env | grep HOME

So on Ubuntu 14.04, the home directory also remains unchanged under sudo. Let's check the user names:

kaya@ubuntu:~$ env | grep USER

kaya@ubuntu:~$ sudo env | grep USER

This is really weird, and looks wrong to me! And the behaviour is completely different to Fedora 21.

Default sudo Behaviour on Fedora 21 and Ubuntu 14.04

The following table gives an overview of the default behaviour of sudo.

env (Fedora) env (Ubuntu) sudo env (Fedora) sudo env (Ubuntu)
USER kaya kaya root root
USERNAME kaya N/A kaya root
HOME /home/kaya /home/kaya /root /home/kaya

The really big problem here is that I mount my home directories via NFS onto my machine. In this environment, the root user only has a restricted access to the users home directory, which often causes some trouble with sudo. Therefore when I want to execute a command as a different user, I not only want to embody his ID, but I also want to use his home directory during the sudo operation. This works with Fedora, but does not work with Ubuntu.

Fixing sudo Behaviour

As it turns out, the behaviour can be fixed by adjusting the file /etc/sudoers according to the sudoers manual. Or simply by copying the relevant sections from the Fedora installation. (Note that you should use visudo for editing the file, as mentioned in the comments inside the file). At the beginning of the file /etc/sudoers on Ubuntu, you should add the following defaults for env_keep:

Defaults    env_reset

Apache Spark Logging

Posted by Kaya Kupferschmidt • Saturday, December 13. 2014 • Category: Programming
I just began learning about Apache Spark, a great tool for Big Data processing. But when I start the spark-shell, I get lots and lots of logging output, which is really annoying to me:
kaya@dvorak:/opt/kaya$ spark-shell 
2014-12-13 17:59:59,652 INFO  [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing view acls to: kaya
2014-12-13 17:59:59,657 INFO  [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing modify acls to: kaya
2014-12-13 17:59:59,657 INFO  [main] spark.SecurityManager (Logging.scala:logInfo(59)) - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(kaya); users with modify permissions: Set(kaya)
2014-12-13 17:59:59,658 INFO  [main] spark.HttpServer (Logging.scala:logInfo(59)) - Starting HTTP Server
2014-12-13 17:59:59,712 INFO  [main] server.Server ( - jetty-8.y.z-SNAPSHOT
2014-12-13 17:59:59,736 INFO  [main] server.AbstractConnector ( - Started SocketConnector@
2014-12-13 17:59:59,736 INFO  [main] util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'HTTP class server' on port 41602.
Welcome to
      __              __
     / _/_  _ ___/ /__
    \ \/  \/  `/ _/  '_/
   /_/ ._/_,// //_\   version 1.1.0

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Type :help for more information.
2014-12-13 18:00:03,792 INFO  [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing view acls to: kaya
2014-12-13 18:00:03,793 INFO  [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing modify acls to: kaya
2014-12-13 18:00:03,793 INFO  [main] spark.SecurityManager (Logging.scala:logInfo(59)) - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(kaya); users with modify permissions: Set(kaya)
2014-12-13 18:00:04,193 INFO  [] slf4j.Slf4jLogger (Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started
2014-12-13 18:00:04,229 INFO  [] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting
2014-12-13 18:00:04,416 INFO  [] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting started; listening on addresses :[akka.tcp://]
2014-12-13 18:00:04,418 INFO  [] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting now listens on addresses: [akka.tcp://]
2014-12-13 18:00:04,425 INFO  [main] util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'sparkDriver' on port 54519.
2014-12-13 18:00:04,439 INFO  [main] spark.SparkEnv (Logging.scala:logInfo(59)) - Registering MapOutputTracker
I searched the net for finding a hint how to get rid of all those INFO messages, but most advices quite didn't work. But finally I found a way to calm down the output of spark-shell for the current user. You need to create a file called (or any other name) and store it in a convenient location. I put mine into my Linux home directory /home/kaya/ The file should contain the following content:
# Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Settings to quiet third party logs that are too verbose$exprTyper=INFO$SparkILoopInterpreter=INFO
That was the easy part. Now you need to tell spark-shell to actually use this file for its logging configuration. This can be done by setting the following environment variable SPARK_SUBMIT_OPTS to -Dlog4j.configuration=file:/home/kaya/ This can be done in bash for example by
export SPARK_SUBMIT_OPTS=-Dlog4j.configuration=file:/home/kaya/
I simply added the line into my .bash_profile file, such that the environment variable gets set every time I log into my computer. And now sparks starts as follows:
kaya@dvorak:/opt/kaya$ spark-shell 
Welcome to
      __              __
     / _/_  _ ___/ /__
    \ \/  \/  `/ _/  '_/
   /_/ ._/_,// //_\   version 1.1.0

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Type :help for more information.
14/12/13 18:09:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context available as sc.
Much better. Plus I now see an important warning which I need to address...

Disable DHCPv6 on AVM Fritzbox

Posted by Kaya Kupferschmidt • Wednesday, January 25. 2012 • Category: Hardware

If you own a FritzBox router from AVM and use IPv6, this might be interesting for you. If IPv6 is enabled, all clients will get a IPv6 DNS server from the router. Although this might seem to be a nice feature, it creates problems if you run your own DNS server for your local net. All Windows clients first will ask the IPv6 DNS server configured from the FritzBox, and then ask other IPv4 DNS servers. This might be especially bad, if you configured some hostnames in your own DNS server differently for your local net than for the internet (this makes sense if you run some server in your net which is also accessible from the internet). In such situations you really want to get rid of that DNS server announced from the FritzBox.

Unfortunately this is not possible from the GUI, but you can disable DHCPv6 (which is used for announcing) by changing some config file on the FritzBox. So you need to do the following:

  1. Enable telnet via #967.
  2. Login to your FritzBox with telnet (or whatever address the FritzBox has in your LAN)
  3. # cd /var/flash
  4. # nvi ar7.cfg
  5. Change the setting dhcpv6lanmode to dhcpv6lanmodeoffstateless
  6. Disable telnet via #968
  7. Reboot the FritzBox

This should completely turn off the DHCPv6 server in the FritzBox.

A Simple Sidebar