Home » Java » java – Setting Mapper and Reducer class in Spark-Exceptionshub

java – Setting Mapper and Reducer class in Spark-Exceptionshub

Posted by: admin February 25, 2020 Leave a comment

Questions:

I am trying to convert the following code from Hadoop to Spark.

    Configuration conf = new Configuration();
    Job j = new Job(conf, "Adjacency Generator Job");

    j.setJarByClass(EdgeListToAdjacencyList.class);
    j.setMapOutputKeyClass(LongWritable.class);
    j.setMapOutputValueClass(Text.class);
    j.setOutputKeyClass(IntWritable.class);
    j.setOutputValueClass(Text.class); 
    j.setInputFormatClass(TextInputFormat.class);
    j.setOutputFormatClass(TextOutputFormat.class);      
    j.setMapperClass(AdjMapper.class);
    j.setReducerClass(AdjReducer.class); 
    FileOutputFormat.setOutputPath(j, new Path(args[1]));
    FileInputFormat.addInputPath(j, new Path(args[0]));
    j.waitForCompletion(true);

This is what I did:

  SparkConf conf=new SparkConf().setAppName("Adjacency Generator Job").setMaster("local[*]");
  JavaSparkContext sc = new JavaSparkContext(conf); 
  JavaRDD<String> infile = sc.textFile(args[0]);
  JavaPairRDD<LongWritable, Text> pair = infile.mapToPair(new PairFunction<String, LongWritable, Text>() {
      private static final long serialVersionUID = 1L;
      @Override
      public Tuple2<LongWritable, Text> call(String s) {
          return new Tuple2<LongWritable, Text>(new LongWritable(), new Text(s));
      }  
  });
   pair.saveAsHadoopFile(args[1], LongWritable.class, Text.class,
          SequenceFileOutputFormat.class);

I am not sure how to add my Mapper and Reducer class. Any help is appreciated.

How to&Answers:

You can use Input/Output format classes, but you cannot re-use mapper/reducer classes.


SparkContext is the old way to write Spark jobs.

SparkSession should be used instead.

The default .text() reader acts the same as the default mapreduce <LongWritable, Text> types/methods, but you don’t really need the line numbers anyway, so you just get lines of text in a DataFrame.

For example,

Dataset<String> filtered = spark.read().textFile(sparkHome + "/README.md")
        .flatMap(line -> Arrays.asList(line.split("\W+")).iterator(), Encoders.STRING())
        .map(word -> word.toLowerCase().trim(), Encoders.STRING())

Dataset<Row> df = filtered
        .map(word -> new Tuple2<>(word, 1L), Encoders.tuple(Encoders.STRING(), Encoders.LONG()))
        .toDF("word", "count")
        .groupBy("word")
        .sum("count").orderBy(new Column("sum(count)").desc()).withColumnRenamed("sum(count)", "_cnt");

df.show(35, false);