Home » Java » Indexing text files using HashTables and JDBC stops after 15 mins

Indexing text files using HashTables and JDBC stops after 15 mins

Posted by: admin June 30, 2018 Leave a comment


In my University they made us code a search engine using the Vector Space Model. We have to index around 600 files of plain text, insert them in a hashtable following the model.
We designed an algorythm which saves in a database three columns: filename, word, frequency. The first two columns are the Primary Keys.

This is the code that does what I have just explained (sorry if the comments are in Spanish, I can translate if you don’t understand the algorythm). This is executed once per file:

Basically first while creates a hash table for each file, adding the words and its frequency. Second while iterates this file hash map and updates the whole hash map which has more data.

        String regex = "[^a-zA-ZñÑá-úÁ-Ú\']";
        Scanner scanner = new Scanner(file,"ISO-8859-1").useDelimiter(regex);
        String aux[];
        DBPosteo dbp = new DBPosteo();

        TSB_OAHashtable<String,Integer> temphash = new TSB_OAHashtable<>(); // temporal hashtable per file

            aux = scanner.nextLine().split(regex); // text line
            for(String st : aux) // we fill file's hash table
                st = st.toLowerCase().replace("æ", "ae"); // st = word
                    if(temphash.containsKey(st)) // if we find the word
                        temphash.put(st, temphash.get( st ) + 1); // +1 to frecuency
                    else // if the word is not in the temporal hash table
                        temphash.put(st, 1);

        Set<Map.Entry<String,Integer>> se = temphash.entrySet();
        Iterator<Map.Entry<String,Integer>> it = se.iterator();

        while(it.hasNext()) // we iterate the temporal hash table of this text file
            Entry<String,Integer> x = it.next();
            String termino = x.getKey();

            if (hash.contains(termino)) // if the main hash table has this word
                DatosTermino dt = hash.get(termino); // then we obtain the data for the vector model (dt) for this word
                dt.setNr(dt.getNr() + 1); // part of the data we need... Nr is the amount of files where the word appears
                if (x.getValue() > dt.getMaxTf()) // we update max Nr
            else // si la palabra no esta en el vocabulario
                DatosTermino dt = new DatosTermino();
                hash.put(termino, dt);

            DatosPosteo dp = new DatosPosteo(file.getName(), x.getValue()); // object that represents part of the data that goes to the DB
            dbp.insertarPosteo(termino, dp); // THIS INSERTS THE 3 COLUMNS I MENTIONED ABOVE

        OAHashtableWriter htw = new OAHashtableWriter(PATHVOCABULARIO);
        htw.write( hash );

We have some console prints to check the progress of it. Starts indexing normally and then after 15 minutes it stops indexing words, the counter will stop increasing but it will keep “indexing” the rest of the files, not adding any new word.

Thank you VERY MUCH in advance…