Home » Java » Indexing text files using HashTables and JDBC stops after 15 mins

Indexing text files using HashTables and JDBC stops after 15 mins

Posted by: admin June 30, 2018 Leave a comment

Questions:

In my University they made us code a search engine using the Vector Space Model. We have to index around 600 files of plain text, insert them in a hashtable following the model.
We designed an algorythm which saves in a database three columns: filename, word, frequency. The first two columns are the Primary Keys.

This is the code that does what I have just explained (sorry if the comments are in Spanish, I can translate if you don’t understand the algorythm). This is executed once per file:

Basically first while creates a hash table for each file, adding the words and its frequency. Second while iterates this file hash map and updates the whole hash map which has more data.

        String regex = "[^a-zA-ZñÑá-úÁ-Ú\']";
        Scanner scanner = new Scanner(file,"ISO-8859-1").useDelimiter(regex);
        String aux[];
        DBPosteo dbp = new DBPosteo();
        dbp.init();

        TSB_OAHashtable<String,Integer> temphash = new TSB_OAHashtable<>(); // temporal hashtable per file

        while(scanner.hasNext())
        {                
            aux = scanner.nextLine().split(regex); // text line
            for(String st : aux) // we fill file's hash table
            {
                st = st.toLowerCase().replace("æ", "ae"); // st = word
                if(!st.equals(""))
                {
                    if(temphash.containsKey(st)) // if we find the word
                    {
                        temphash.put(st, temphash.get( st ) + 1); // +1 to frecuency
                    }
                    else // if the word is not in the temporal hash table
                    {
                        temphash.put(st, 1);
                    }
                } 
            }
        }

        Set<Map.Entry<String,Integer>> se = temphash.entrySet();
        Iterator<Map.Entry<String,Integer>> it = se.iterator();

        while(it.hasNext()) // we iterate the temporal hash table of this text file
        {
            Entry<String,Integer> x = it.next();
            String termino = x.getKey();


            if (hash.contains(termino)) // if the main hash table has this word
            {
                DatosTermino dt = hash.get(termino); // then we obtain the data for the vector model (dt) for this word
                dt.setNr(dt.getNr() + 1); // part of the data we need... Nr is the amount of files where the word appears
                if (x.getValue() > dt.getMaxTf()) // we update max Nr
                {
                    dt.setMaxTf(x.getValue()); 
                }
            }
            else // si la palabra no esta en el vocabulario
            {
                DatosTermino dt = new DatosTermino();
                hash.put(termino, dt);
            }

            DatosPosteo dp = new DatosPosteo(file.getName(), x.getValue()); // object that represents part of the data that goes to the DB
            dbp.insertarPosteo(termino, dp); // THIS INSERTS THE 3 COLUMNS I MENTIONED ABOVE
        }

        dbp.finalizar();
        OAHashtableWriter htw = new OAHashtableWriter(PATHVOCABULARIO);
        htw.write( hash );

We have some console prints to check the progress of it. Starts indexing normally and then after 15 minutes it stops indexing words, the counter will stop increasing but it will keep “indexing” the rest of the files, not adding any new word.

Thank you VERY MUCH in advance…

Answers: