Ingesting SEC disclosures for algorithmic natural language processing (NLP) is difficult because the HTML is poorly formed. Now Calcbench API users can access standardized disclosure HTML.
For instance, Microsoft's Contingencies note looks like this -
but the HTML looks like this -
everything is a paragraph, there is no hierarchy, the headers are not headers.
Calcbench's standardized HTML looks like this -
The hierarchy of headers headers is correct and they are in sections with the text to which they refer.
To get the standardized HTML use the disclosure API (Calcbench API access required) and pass the
standardized=True
to the
DisclosureSearchResults
objects returned by the
disclosure_search
method ,
documentation.