I'm currently working on a project that scrapes the SEC's EDGAR website for type 4 filings.

I currently have the required data in raw text format that somehow looks like xml, i really can't tell what it is but i'm trying to parse this data into json.

I've not parsed something as complex as this before and will appreciate any form of pointers as to how to go about this.

i have attached a screenshot of one sample.

this link fetches the data of a single filing in text format.

  • 2
    Good luck with that. That's a mixed format.

    The main document follows some sort of custom format. You can see that the header is marked as "SGML" which is just a standard. This at the very least means they should follows the standard so you can parse the main tags like SEC-DOCUMENT and SEC-HEADER into some sort of SecDocument object

    each custom tag seems to also allow for additional info in the same line as the tag, which seems to be [filename : ] timestamp.

    Additionally the actual header content is a separate format and it's a indentation tree + tab separated dictionary. This whole thing should be simple parsable into an object and thus translate into JSON directly

    Lastly the actual Document can have any other format they want, your example is an XML. Here you will need to check the <TEXT> tag content for a format tag, probably always first line after TEXT e.g. <XML> And you'll need to use the proper format parser to transform that into a JSON
  • 2
    Seems pretty easy the first part before the real xml seems to be irrelevant.

    Then you only parse the real xml and check for documenttype

    Or you just search the full text without parsing for


    Or am i missing something here
  • 2
    so really, you need to implement (or find implemented) at least 3 parsers, one for their custom SGML, one for the tree/object data and one for whatever format the file is in

    unless this is some sort of standardized format that you can find a library for, but I have never seen this one before so I wouldn't know... but ultimately you should be able to parse it easily if you split your workload into separate parsers and then just combine them as needed
  • 1
    Looks like some XML-like format with natural text-like values. Also, some tags such ass <acceptance-datetime> are not being closed, which is fun.

    The text section has an embedded XML document, which also is interesting.

    You have some more documents so that we can see if the format deviates? Looks fun and rather straight forward to parse :)

    What language are you using btw?
  • 0
    @ScriptCoded i have checked multiple documents and they are the same. i use nodejs
Add Comment