Home » Android » My HTML fetcher program in java returns incomplete results

My HTML fetcher program in java returns incomplete results

Posted by: admin November 1, 2017 Leave a comment

Questions:

My java code is:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class celebGrepper {

    static class CelebData {
        URL link;
        String name;

        CelebData(URL link, String name) {
            this.link=link;
            this.name=name;
        }
    }

    public static String grepper(String url) {
        URL source;
        String data = null;

        try {
            source = new URL(url);
            HttpURLConnection connection = (HttpURLConnection) source.openConnection();
            connection.connect();

            InputStream is = connection.getInputStream();

            /**
             * Attempting to fetch an entire line at a time instead of just a character each time!
             */
            StringBuilder str = new StringBuilder();
            BufferedReader br = new BufferedReader(new InputStreamReader(is));

            while((data = br.readLine()) != null)
                str.append(data);

            data=str.toString();

        } catch (IOException e) {
            e.printStackTrace();
        }

        return data;
    }

    public static ArrayList<CelebData> parser(String html) throws MalformedURLException {
        ArrayList<CelebData> list = new ArrayList<CelebData>();

        Pattern p = Pattern.compile("<td class=\"image\".*<img src=\"(.*?)\"[\s\S]*<td class=\"name\"><a.*?>([\w\s]+)<\/a>");
        Matcher m = p.matcher(html);

        while(m.find()) {
            CelebData current = new CelebData(new URL(m.group(1)),m.group(2));
            list.add(current);
        }

        return list;
    }

    public static void main(String... args) throws MalformedURLException {
        String html = grepper("https://www.forbes.com/celebrities/list/");
        System.out.println("RAW Input: "+html);
        System.out.println("Start Grepping...");
        ArrayList<CelebData> celebList = parser(html);
        for(CelebData item: celebList) {
            System.out.println("Name:\t\t "+item.name);
            System.out.println("Image URL:\t "+item.link+"\n");
        }
        System.out.println("Grepping Done!");
    }

}

It’s supposed to fetch the entire HTML content of https://www.forbes.com/celebrities/list/. However, when I compare the actual result below to the original page, I find the entire table that I need is missing! Is it because the page isn’t completely loaded when I start getting the bytes from the page via the input stream? Please help me understand.


The Output of the page:

https://jsfiddle.net/e0771aLz/

What can I do to just extract the Image link and the names of the celebs?


I know it’s an extremely bad practice to try to parse HTML using regex and is the stuff of nightmares, but on a certain video training course for android, that’s exactly what the guy did, and I just wanna follow along since it’s just in this one lesson.

Answers: