Preface
One of Apache Spark’s main goals is to make big data applications easier to write. Spark has always had concise APIs in Scala and Python, but its Java API was verbose due to the lack of function expressions. With the addition of lambda expressions in Java 8, we’ve updated Spark’s API to transparently support these expressions, while staying compatible with old versions of Java. This new support will be available in Apache Spark 1.0.
A Few Examples
The following examples show how Java 8 makes code more concise. In our first example, we search a log file for lines that contain “error”, using Spark’s filter and count operations. The code is simple to write, but passing a Function object to filter is clunky:
Java 7 search example:
- JavaRDD lines = sc.textFile("hdfs://log.txt").filter(
- new Function() {
- public Boolean call(String s) {
- return s.contains("error");
- }
- });
- long numErrors = lines.count();
With Java 8, we can replace the Function object with an inline function expression, making the code a lot cleaner:
Java 8 search example:
- JavaRDD lines = sc.textFile("hdfs://log.txt")
- .filter(s -> s.contains("error"));
- long numErrors = lines.count();
Java 7 word count:
- JavaRDD lines = sc.textFile("hdfs://log.txt");
- // Map each line to multiple words
- JavaRDD words = lines.flatMap(
- new FlatMapFunction() {
- public Iterable call(String line) {
- return Arrays.asList(line.split(" "));
- }
- });
- // Turn the words into (word, 1) pairs
- JavaPairRDD ones = words.mapToPair(
- new PairFunction() {
- public Tuple2 call(String w) {
- return new Tuple2(w, 1);
- }
- });
- // Group up and add the pairs by key to produce counts
- JavaPairRDD counts = ones.reduceByKey(
- new Function2() {
- public Integer call(Integer i1, Integer i2) {
- return i1 + i2;
- }
- });
- counts.saveAsTextFile("hdfs://counts.txt");
Java 8 word count:
- JavaRDD lines = sc.textFile("hdfs://log.txt");
- JavaRDD words =
- lines.flatMap(line -> Arrays.asList(line.split(" ")));
- JavaPairRDD counts =
- words.mapToPair(w -> new Tuple2(w, 1))
- .reduceByKey((x, y) -> x + y);
- counts.saveAsTextFile("hdfs://counts.txt");
We are very excited to offer this functionality, as it opens up the simple, concise programming style that Scala and Python Spark users are familiar with to a much broader set of developers.
Availability
Java 8 lambda support will be available in Apache Spark 1.0, which will be released in early May. Although using this syntax requires Java 8, Apache Spark 1.0 will still support older versions of Java through the old form of the API. Lambda expressions are simply a shorthand for anonymous inner classes, so the same API can be used in any Java version.
沒有留言:
張貼留言