Author Archives: helena

Yet Another Distributed Systems Reading List

I started putting together a reading list for engineers fairly new to distributed systems and theory, with little hands-on experience yet. Having worked solely with/on teams of all senior engineers, I am now running a team with very senior but also one or two relatively recent grads.  In compiling this list to cherry pick from, with input from many twitter friends in response to my plea for input, I realized there’s some I’ve not yet read so this is super useful, many thanks to people that replied.

Some pages make this a list of lists or post of posts :) Ping me on twitter if you have more gems to add https://twitter.com/helenaedelson. I hope to make some sense of this at some point, like a ‘start here’ chunk of readings, progressing to the others.

In No Order:

Share/Save

Comparing and Choosing Technologies

Unable to sleep at 2:30 AM, I just read an interesting case study with a large company on how they have started to address Ad Tech applied to TV households vs mobile and web, with frequency capping to viewers, etc. Not surprisingly they use Kafka. It made me think of a situation I was in where someone brought up Solace versus Kafka.

When I listen to people that talk about any technology, I try to really listen to

  • Whether they are speaking from a place of personal experience or from what they have read and surmised from blogs by users that may or may not know what they are doing with that technology, may or may not have applied it properly, and so forth
  • If they have merely played around with that technology or do they have actual production experience
  • And if they have prod experience with that technology was it actually at scale, under real world load

Or, are they talking about a solution where required features are not yet completed – i.e. currently vaporware in a sense. These sort of situations must be taken with care, as opposed to proven ones.

For example, trying to compare Solace to Kafka is comparing apples to oranges. I have seen this sort of comparison attempted many times over the last decade that I have been in messaging, and it (a pun) is fruitless :)

They both serve a different purpose in an architecture, and while they overlap in some functionalities, Solace seems to fit more particular use cases – and handle those use cases very well. While some features are both out of the box with Kafka, and proven in production – at extremely high load and at scale.  No technology is a one size fits all solution. For example I know a production system that uses both an Aerospike cluster to solve some problems and a Cassandra cluster to solve others – both NoSQL stores serving different applications in different areas of the architecture. Taking this a step further, a deployment architecture for one technology, such as Cassandra, would have multiple clusters, one cluster for analytics, one cluster for timeseries, and so forth.

The concept I want to highlight is event-driven architecture, nothing new, and I’ve been working in this for years, but it is quite different from more conventional ETL data flows. Leveraging this architecture can both decouple cleanly and improve speed if done right. In ad tech if you do not respond in time, or correctly, you lose the bid and thus lose money. But this is only one part of the system. It’s not the high-frequency trading use case but it shares some similar requirements.

If you are promised features or an MVP by a team you may be a customer for, including being an internal customer, consider whether that team is motivated enough to have your team/company as a customer, and willing as well as able to do the work quickly to support it. For example, I would dig further into this with their product manager, project manager and engineering lead to find out if they have the a) roadmap, b) priorities, c) resources d) QA, test automation and perf engineers to do this. I wouldn’t want to sign up for it only to find they can’t deliver, or can’t deliver in the necessary time frame.

Things like hadoop or terradata are very nice but you can’t scale with them, and they are not applicable for streaming, event-driven architectures and workloads. You can however scale with Kafka and leverage Kafka to handle the ingestion stream to run analytics and respond back in-line, for example a system run on Kafka that can do distributed RPC, leveraging Akka (or Storm but akka is easier). You can do a lot in 20 ms. For example you can write a system where the consumer sees the produced data in 2 ms tops. A response via Akka remoting for example is another ms hop back. 1 ms response time. There are many ways to solve this of course, and here again, you can use Kafka – Storm, you can use Kafka – Akka and feed to Spark if you can do micro-batching and do not need very low latency analytics response. It all depends on your requirements – what are you solving for, your competition and your constraints.

 

Share/Save

Changes

It seems the word is out that I left DataStax. I made so many friends there, even recruited an old friend, and know many of us will continue to cross paths and collaborate. Thanks to everyone for the great memories.

Over the last year I have spoken at several Spark, Scala and Machine Learning meetups, then internationally at Scala Conferences like Scala Days Amsterdam, many Big Data conferences such as Spark Summit, Strata and Emerging Tech Conferences, all of which I hope to continue to do.

Spark Summit East 2015

Scala Days Amsterdam, 2015



Philly ETE – Emerging Tech 2015

In October 2015 I’ll be speaking at Spark Summit EU, and on the Spark panel at Grace Hopper. Also I’ll be at Strata NYC again, this year hosting the Spark track on Thursday. I’m adding some cool stuff to my Big Data Lambda talk and new code.

What’s next – I’m humbled to have been contacted by many engineers I know and highly respect about working at their companies, where in some cases they are the founders. This is making for a very difficult decision because they are all doing really interesting and exciting work with open source technologies I like. In the meantime I am taking a month or so to enjoy the last part of summer, something I haven’t had time to do for years :)

Share/Save

Spark Streaming with Kafka and Cassandra

This is a lightning post on a future, more in-depth post on how to integrate Spark Streaming, Kafka and Cassandra, using the Spark Cassandra Connector.

First, here are some of the dependencies used below:

1
2
3
4
5
6
7
8
9
10
11
import com.datastax.spark.connector.streaming._
import kafka.serializer.StringDecoder
import org.apache.spark.{SparkEnv, SparkConf, Logging}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka._
import com.datastax.spark.connector.demo.streaming.embedded._
import com.datastax.spark.connector.cql.CassandraConnector
import com.datastax.spark.connector.SomeColumns
import com.datastax.spark.connector.demo.Assertions

Start by configuring Spark:

1
2
3
4
5
  val sc = new SparkConf(true)
    .set("spark.cassandra.connection.host", SparkMaster)
    .set("spark.cleaner.ttl", "3600")
    .setMaster("local[12]")
    .setAppName("Streaming Kafka App")

Where ‘SparkMaster’ can be a comma-separated list of Cassandra seed nodes.

Using the SparkConf object you can easily configure as well as populate Cassandra with a keyspace, table and data for quick prototyping. This creates the keyspace and table in Cassandra:

1
2
3
4
5
  CassandraConnector(sc).withSessionDo { session =>
    session.execute(s"CREATE KEYSPACE IF NOT EXISTS streaming_test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 3 }")
    session.execute(s"CREATE TABLE IF NOT EXISTS streaming_test.key_value (key VARCHAR PRIMARY KEY, value INT)")
    session.execute(s"TRUNCATE streaming_test.key_value")
  }

Custom embedded class which starts an (embedded local Zookeeper server and an (embedded Kafka server and client (all very basic here):

1
val kafka = new EmbeddedKafka

Create the Spark Streaming context.

1
val ssc =  new StreamingContext(sc, Seconds(2))

One of several ways to insure shutdown of ZooKeeper and Kafka for this basic prototype:

1
SparkEnv.get.actorSystem.registerOnTermination(kafka.shutdown())

Assuming you already have or are creating a new topic in Kafka and are sending messages to Kafka…. but here is a very basic way to do it if not:

1
2
3
4
5
6
7
8
9
10
11
12
def createTopic(topic: String): Unit = 
    CreateTopicCommand.createTopic(client, topic, 1, 1, "0")
 
def produceAndSendMessage(topic: String, sent: Map[String, Int]) {
    val p = new Properties()
    p.put("metadata.broker.list", kafkaConfig.hostName + ":" + kafkaConfig.port)
    p.put("serializer.class", classOf[StringEncoder].getName)
 
    val producer = new Producer[String, String](new ProducerConfig(p))
    producer.send(createTestMessage(topic, sent): _*)
    producer.close()
  }

Where the createTestMessage function just for these purposes generates test data.

Creates an input stream that pulls messages from a Kafka Broker.

1
2
val stream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
      ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY)

Import the implicits enabling the DStream’s ‘saveToCassandra’ functions:

1
import com.datastax.spark.connector.streaming._

This defines the work to do in the stream. Configure the stream computations (here we do a very rudimentary computation) and write the DStream output to Cassandra:

1
2
3
4
5
6
stream.map { case (_, v) => v }
      .map(x => (x, 1))
      .reduceByKey(_ + _)
      .saveToCassandra("streaming_test", "key_value", SomeColumns("key", "value"), 1)
 
    ssc.start()

Now we can query our Cassandra keyspace and table to see if the expected data has been persisted:

1
2
3
4
5
 import scala.concurrent.duration._
val rdd = ssc.cassandraTable("streaming_test", "key_value").select("key", "value")
    awaitCond(rdd.collect.size == sent.size, 5.seconds)
    val rows = rdd.collect
    sent.forall { rows.contains(_)}

And finally, our assertions being successful, we can explicitly do a graceful shutdown of the Spark Streaming context with:

1
ssc.stop(stopSparkContext = true, stopGracefully = true)

Share/Save