This is a quick post today on removing HTML tags using the stringr package in R.

My purpose here is in taking some raw data, which can include HTML markup, and preparing it for a vectorizer.  I don’t need the resulting output to look pretty; I just want to get rid of the HTML characters.

descr <- "<div>This is a <b>Tag</b> page. <p align=\"true\">Something.  Something else. <span>Tag</span>.</div><div>8 < 3</div><div>14 <> 9 </div>"
descr <- stringr::str_replace_all(descr, '<[^<>]+?>', ' ')
descr <- stringr::str_replace_all(descr, ' {2,}', ' ')
descr
# Expected results:
# " This is a Tag page. Something. Something else. Tag . 8 < 3 14 <> 9 "

I have a two-step process here. In the first step, I’m replacing any instance of text between angle brackets with a single space. The second replacement removes multiple spaces. I don’t do any checking for correctness, so there is a chance that fake tags ought to be part of the document don’t make it past the doorman. That said, it works well enough when you don’t need perfection.

Posted in R

One thought on “Stripping Out HTML With StringR

Leave a comment