Home » Php » php – How to get HTML content text of a Wikipedia Page (via Wikipedia API)?

php – How to get HTML content text of a Wikipedia Page (via Wikipedia API)?

Posted by: admin July 12, 2020 Leave a comment

Questions:

i just want to get content (no link, no categories, no images…just text)

How to&Answers:

There is no way to get “just the text” from the Wikipedia API. You can either download the HTML of the page (if you do this via index.php rather than api.php, use action=render to avoid downloading all the skin content) or the wikitext (which you can do via the API or by passing action=raw to index.php); you will then have to parse it yourself to remove the bits you don’t want to keep.

In the HTML output, MediaWiki is generally good about adding classes to various interface elements you might want to filter out; the templates and such created by users are perhaps less so (e.g. the hack for table sorting just puts some text in a display:none span, no class).

To get the wikitext via the API, use prop=revisions. To get the rendered HTML, use action=parse.