This is a quick post today on removing HTML tags using the stringr package in R.

My purpose here is in taking some raw data, which can include HTML markup, and preparing it for a vectorizer.  I don’t need the resulting output to look pretty; I just want to get rid of the HTML characters.

descr <- "<div>This is a <b>Tag</b> page. <p align=\"true\">Something.  Something else. <span>Tag</span>.</div><div>8 < 3</div><div>14 <> 9 </div>"
descr <- stringr::str_replace_all(descr, '<[^<>]+?>', ' ')
descr <- stringr::str_replace_all(descr, ' {2,}', ' ')
descr
# Expected results:
# " This is a Tag page. Something. Something else. Tag . 8 < 3 14 <> 9 "

I have a two-step process here. In the first step, I’m replacing any instance of text between angle brackets with a single space. The second replacement removes multiple spaces. I don’t do any checking for correctness, so there is a chance that fake tags ought to be part of the document don’t make it past the doorman. That said, it works well enough when you don’t need perfection.

Advertisement
Posted in R

One thought on “Stripping Out HTML With StringR

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s