Friday, August 14, 2015

Extracting comments from Google Docs

One of the best parts about Google Docs is the whole easy-to-share aspect of it, which includes an excellent commenting system.

Google will happily let you export your document, but this extra info is harder to get ahold of. From the drop-down menu you can currently export your file as: PDF or RTF (has formatting, lacks any comments), TXT (has comments, lacks formatting), or HTML (has comments and formatting). However, check out what the HTML output looks like:

That's really close to what I want, but it lacks the highlighting to indicate exactly what the comment is referencing. Or does it?

Here I've marked up all the <p> tags with a green border and all the <span> tags in blue. You can see that there are spans wrapping the formatted text, but also where the highlighted text would be. Most notably, see how "obnoxious" is split across two tags.

Unfortunately, there's no markup in the <span>s themselves to indicate which comment it goes with. There's the tag/link right after it, which will have to be enough.

<sup><a href="#cmnt1" name="cmnt_ref1">[a]</a></sup>

But now, how to tell how many of those previous span tags, working back from the link, is part of the comment? For that, you gotta' lean on the Google Drive API. Specifically, the comment's list endpoint.

Look at that! It even has the user who made the comment and the timestamp. All we could ever want. It also indicates there was a reply, which the [b] comment is. The important field it has though is context. It doesn't tell me *where* in the document that context is (that's okay, we've got the HTML) but it does tell us how much of the content the comment spans.

And that's it. Well, the basics at least. From there, it's a bunch of corner cases about character escaping and sticking together the spans the right way so you can match the context string. Haven't ironed all those out yet, myself.

For some specific examples (and some gnarly code) you can checkout some segments of my project on Github, specifically grabbing file information, content, and comments and pairing comment with highlighted text in a very brittle way.

If there's a better way to do this, I was unable to find it and would love to hear about it! Seems odd that it was this convoluted... am likely missing something.

No comments:

Post a Comment