Home » Java » java – Lucene 8.4.1 – LatLonShape.createIndexableFields vs RecursivePrefixTreeStrategy.createIndexableFields-Exceptionshub

java – Lucene 8.4.1 – LatLonShape.createIndexableFields vs RecursivePrefixTreeStrategy.createIndexableFields-Exceptionshub

Posted by: admin February 25, 2020 Leave a comment

Questions:

I’m working with Lucene Version 8.4.1 and got some questions about spatial indexing.
Its about performance and later on spatial searching. My test data are about 10000 Polygons. Thats the small data set.

First of all my setup:

            // JtsSpaticalContext is needed to index polygons
            this.ctx = JtsSpatialContext.GEO;
            SpatialPrefixTree tree = new GeohashPrefixTree(this.ctx, MAX_LEVEL);
            this.strategy = new RecursivePrefixTreeStrategy(tree, GEOMETRY_FIELDNAME);
            this.shapeReader = this.ctx.getFormats().getWktReader();
            // Creating the path for lucene index files
            Path path = Paths.get(INDEX_FOLDER);
            this.dir = SimpleFSDirectory.open(path);

            // preparing IndexWriter
            IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());           
            config.setOpenMode(OpenMode.CREATE);
            config.setRAMBufferSizeMB(256.0);
            config.setUseCompoundFile(false);           
            config.setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH);

            LogMergePolicy policy = new LogDocMergePolicy();
            policy.setMergeFactor(15);
            config.setMergePolicy(policy);

            this.indexWriter = new IndexWriter(dir, config);

As you can see I am using the JtsSpatialContext for indexing my spatial data. The config is still some magic for me, this consulation was given me the best results. MAX_LEVEL of the GeohashPrefixTree is set to 11. Also: this.shapeReader = this.ctx.getFormats().getWktReader is used to get rid of the Deprecated warning which is shown when I’m using ctx.readFromWkt. I watched at the SpatialExample.java from the official Lucene Github Repo Github Repo SpatialExample

Now as I said I want to index ~10000 Polygons which is in my use case the small data set. I got two approaches to index those data, differed by CaseA and CaseB

Here is my logic how I add those Polygons to my index:

        // Start Case A
        List<String> testDataCaseA = new ArrayList<>();     
        for (int i = 0; i < 10000; i++) {
            testDataCaseA.add("POLYGON((9.0842201 48.80324419974113,9.084344 48.803237199741126,9.0843574 48.80333909974109,9.0842334 48.8033461997411,9.0842201 48.80324419974113))");
        }

        long startCaseA = System.nanoTime();

        testDataCaseA.parallelStream().forEach(current -> {
            try {
                this.indexWriter.addDocument(createDocumentCaseA(current));
            } catch (InvalidShapeException | IOException | java.text.ParseException e) {
                logger.error(e.toString());
            }
        });

        double elapsedTimeCaseA = (System.nanoTime() - startCaseA) / 1000000;
        logger.trace("Elapsed Time: " + elapsedTimeCaseA + "ms");
        // End Case A

        // Deleting the index
        this.indexWriter.deleteAll();

        // Start Case B
        List<String> testDataCaseB = new ArrayList<>(); 
        for (int i = 0; i < 10000; i++) {
            testDataCaseB.add("{\"type\":\"Polygon\",\"coordinates\":[[[9.0842201,48.80324419974113],[9.084344,48.803237199741126],[9.0843574,48.80333909974109],[9.0842334,48.8033461997411],[9.0842201,48.80324419974113]]]}");
        }

        long startCaseB = System.nanoTime();

        testDataCaseB.parallelStream().forEach(current -> {
            try {
                this.indexWriter.addDocument(createDocumentCaseB(current));
            } catch (java.text.ParseException | IOException e) {
                logger.error(e.toString());
            }
        });

        double elapsedTimeCaseB = (System.nanoTime() - startCaseB) / 1000000;
        logger.trace("Elapsed Time: " + elapsedTimeCaseB + "ms");
        // End Case B

And here are the createDocumentCaseA and createDocumentCaseB methods

    private Document createDocumentCaseA(String geom) throws java.text.ParseException, InvalidShapeException, IOException {
        Document doc = new Document();

        for (Field f : strategy.createIndexableFields(this.shapeReader.read(geom))) {
            doc.add(f);
        }

        return doc;
    }

    private Document createDocumentCaseB(String geom) throws java.text.ParseException {
        Document doc = new Document();

        for (Polygon poly : Polygon.fromGeoJSON(geom)) {
            for (Field f : LatLonShape.createIndexableFields(GEOMETRY_FIELDNAME, poly)) {
                doc.add(f);
            }
        }

        return doc;
    }

The difference between these two variants is astonishing:

Case A Elapsed Time: 41522.0ms

Case B Elapsed Time: 168.0ms

Well, I thought: “Hm, okay, then I just choose case B and all is fine”.
But my problem is on the “understandable level”: What is the “right way” of doing this? And: If I’m using spatial search, in CaseA I got the method strategy.makeQuery(SpatialArgs), in CaseB I need to use LatLonShape.createXYQuery(someStuff)

Which is the way to choose? Am I missing something in the docs from Lucene?

How to&Answers: