Let's talk HTML Fragments in your clipboard when coding on Windows. I've created what's arguably the best Markdown editor on Win10 (if I do say so myself), and wanted it to be more intelligent when I was pasting content that comes from web browsers. That is, in the original version of MarkUpDown, if you copied the text I've got highlighted, below, from DaringFireball...

Quote from Walt Mossberg on Siri plus a little bit from Gruber, now with visible context menu saying "Copy"

... and you pasted it into MarkUpDown as a quotation (Ctrl-Shift-V), you'd get unstyled text, like this...

> Mossberg:
> For instance, when I asked Siri on my Mac how long it would take me to get to work, it said it didnโ€™t have my work address โ€” even though the โ€œmeโ€ contact card contains a work address and the same synced contact card on my iPhone allowed Siri to give me an answer.
> Similarly, on my iPad, when I asked what my next appointment was, it said โ€œSorry, Walt, somethingโ€™s wrongโ€ โ€” repeatedly, with slightly different wording, in multiple places on multiple days. But, using the same Apple calendar and data, Siri answered correctly on the iPhone.
> These sort of glaring inconsistencies are almost as bad as universal failures. The big problem Apple faces with Siri is that when people encounter these problems, they stop trying.

I mean, that's okay-ish. Unfortunately, you'd have to go put in an additional line after each line so that it's not all scrunched into one paragraph. And you lose that Gruber was block-quoting Mossberg. And you lose that "they stop trying" was originally in italics. That stinks.

Isn't Markdown really supposed to be shorthand for html? Can't we handle pasting html better than that?

Yes, yes we can...


TL;DR

Go pull the HtmlFragmentHelper, a working, in-progress library for turning HTML Fragment strings into view models with these properties...

public class HtmlFragmentViewModel
{
    public string Version = "";
    public int StartHtml = int.MinValue;
    public int EndHtml = int.MinValue;
    public int StartFragment = int.MinValue;
    public int EndFragment = int.MinValue;
    public string SourceUrl = "";
    public string FragmentSourceRaw = "";
    public string Error = "";

    //...
}

There are also several convenience methods to help you get, say, the top- and second-level domains for the HTML Fragment's source URL (example.com from http://blog.example.com/user/blog0000231.html), or just the html from the selected fragment (myHtmlFragmentViewModel.FragmentSourceParsed), etc.

Here's an example:

public string CreateQuote(string htmlFragmentSource) 
{
    string strIntroLink = string.Empty;
    HtmlFragmentViewModel vm = new HtmlFragmentViewModel(htmlFragmentSource);

    if (vm.SourceUrl.Length > 0 && vm.SourceUrlDomainSecondAndTopLevelsOnly.Length > 0)
    {
        strIntroLink = string.Format("From <a href=\"{0}\">{1}</a>:" + Environment.NewLine + Environment.NewLine,
            vm.SourceUrl, vm.SourceUrlDomainSecondAndTopLevelsOnly);
    }

    return strIntroLink + vm.FragmentSourceParsed;
}

Read how to use it here. Get it from GitHub, though the only file you really need is here.

And here's a quick article on how to get code back into the clipboard as html, which I'm not doing.


HTML Fragment Format

Want to paste clipboard HTML as HTML? Enter Microsoft's HTML clipboard format. Instead of using the standard code to read the clipboard's contents, like this...

string clipboardText = await Clipboard.GetContent().GetTextAsync();

... now you use this...

string htmlClipboardText = await Clipboard.GetContent().GetHtmlFormatAsync();

But when you look at the contents of htmlClipboardText, you quickly notice that strange things are afoot at the Circle K.

What number are you thinking of?

What comes back isn't a string with html source, but an "HTML Fragment" formatted string that looks like this:

HTML Fragment from Edge

Version:1.0
StartHTML:000000210
EndHTML:000003550
StartFragment:000002696
EndFragment:000003500
StartSelection:000002696
EndSelection:000003496
SourceURL:http://daringfireball.net/2016/10/mossberg_siri
<!DOCTYPE HTML>
<HTML lang="en"><HEAD>       <!-- Open Graph [jive] --> <!-- 
    <meta property="og:site_name"   content="Daring Fireball" />
    <meta property="og:title"       content="Walt Mossberg: โ€˜Why Does Siri Seem So Dumb?โ€™" />
    <meta property="og:url"         content="http://daringfireball.net/2016/10/mossberg_siri" />
    <meta property="og:description" content="In addition to the engineering hurdles to actually make Siri much better, Apple also has to overcome a โ€œboy who cried wolfโ€ credibility problem." />
    <meta property="og:image"       content="https://daringfireball.net/graphics/df-square-192" />
    <meta property="og:type"        content="article" />
 -->         <!-- Twitter Card [jive] -->                    <TITLE>Daring Fireball: Walt Mossberg: โ€˜Why Does Siri Seem So Dumb?โ€™</TITLE>        <LINK href="/graphics/apple-touch-icon.png" rel="apple-touch-icon-precomposed">     <LINK href="/graphics/favicon.ico?v=005" rel="shortcut icon">   <LINK href="/graphics/dfstar.svg" rel="mask-icon" color="#4a525a">  <LINK href="/css/fireball_screen.css?v1.7" rel="stylesheet" type="text/css" media="screen">     <LINK href="/css/ie_sucks.php" rel="stylesheet" type="text/css" media="screen">     <LINK href="/css/fireball_print.css?v01" rel="stylesheet" type="text/css" media="print">    <LINK href="/feeds/main" rel="alternate" type="application/atom+xml">   
<SCRIPT src="/mint/?js" type="text/javascript" async=""></SCRIPT>

<SCRIPT src="http://www.google-analytics.com/ga.js" type="text/javascript" async=""></SCRIPT>

<SCRIPT src="/js/js-global/FancyZoom.js" type="text/javascript"></SCRIPT>

<SCRIPT src="/js/js-global/FancyZoomHTML.js" type="text/javascript"></SCRIPT>
     <LINK title="Home" href="/" rel="home">     <LINK href="http://df4.us/pfz" rel="shorturl">  <LINK title="Apple Responds to Dash Controversy" href="http://daringfireball.net/2016/10/apple_dash_controversy" rel="prev">    
<SCRIPT src="http://daringfireball.net/mint/?record&amp;key=383950464d37374b39333637695970466458724e6779513431&amp;referer=&amp;resource=http%3A//daringfireball.net/2016/10/mossberg_siri&amp;resource_title=Daring%20Fireball%3A%20Walt%20Mossberg%3A%20%u2018Why%20Does%20Siri%20Seem%20So%20Dumb%3F%u2019&amp;resource_title_encoded=0&amp;window_width=1756&amp;window_height=921&amp;resolution=2438x1371&amp;flash_version=0&amp;1476397798179&amp;serve_js" type="text/javascript"></SCRIPT>
</HEAD><BODY onload="setupZoom()"><DIV id="Box"><DIV id="Main"><DIV class="article"><!--StartFragment--><P>Mossberg:</P><BLOCKQUOTE><P>For instance, when I asked Siri on my Mac how long it would take me to get to work, it said it didnโ€™t have my work address โ€” even though the โ€œmeโ€ contact card contains a work address and the same synced contact card on my iPhone allowed Siri to give me an answer.</P><P>Similarly, on my iPad, when I asked what my next appointment was, it said โ€œSorry, Walt, somethingโ€™s wrongโ€ โ€” repeatedly, with slightly different wording, in multiple places on multiple days. But, using the same Apple calendar and data, Siri answered correctly on the iPhone.</P></BLOCKQUOTE><P>These sort of glaring inconsistencies are almost as bad as universal failures. The big problem Apple faces with Siri is that when people encounter these problems, <EM>they stop trying</EM>.</P><!--EndFragment--></DIV></DIV></DIV></BODY></HTML>

Wow. No, really, wow. That surprised me. I was using Edge this time instead of Chrome, and whoa. Edge includes the entire page's header. That said, that's really not a horrible idea. We have really good context for this fragment, and if we wanted the original CSS, for example, we know where to get it.

Aside: Look at the info DaringFireball's sucking in:

resolution=2560x1440&amp;
flash_version=0&amp;
1476452395012&amp;
serve_js

Gruber doesn't use Flash. Kinda Panopticlicky, ain't it? Looks like it's supporting this.

Another Aside: Internet Explorer 11 creates the same fragment source, so if you think Edge was created from the ground up...

Interestingly, the old Internet Explorer copyright box used to reference Mosaic, but that's gone in IE11. Wonder if IE11 is really the child of IE6 and 7? But enough aside-ing...

HTML Fragment from Chrome

But let's take a look at where I initially started, with Chrome.

Version:0.9
StartHTML:0000000164
EndHTML:0000002719
StartFragment:0000000200
EndFragment:0000002683
SourceURL:http://daringfireball.net/2016/10/mossberg_siri
<html>
<body>
<!--StartFragment--><p style="margin: 0px 0px 1.6em; padding: 0px; color: rgb(238, 238, 238); font-family: Verdana, &quot;Bitstream Vera Sans&quot;, sans-serif; font-size: 11px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(74, 82, 90);">Mossberg:</p><blockquote style="font-size: 11px; margin: 2em 2em 2em 1em; padding: 0px 0.75em 0px 1.25em; border-left: 1px solid rgb(119, 119, 119); border-right: 0px solid rgb(119, 119, 119); outline: 0px; vertical-align: baseline; background: rgb(74, 82, 90); color: rgb(238, 238, 238); font-family: Verdana, &quot;Bitstream Vera Sans&quot;, sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-t
ransform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px;"><p style="margin: 0px 0px 1.6em; padding: 0px;">For instance, when I asked Siri on my Mac how long it would take me to get to work, it said it didnโ€™t have my work address โ€” even though the โ€œmeโ€ contact card contains a work address and the same synced contact card on my iPhone allowed Siri to give me an answer.</p><p style="margin: 0px 0px 1.6em; padding: 0px;">Similarly, on my iPad, when I asked what my next appointment was, it said โ€œSorry, Walt, somethingโ€™s wrongโ€ โ€” repeatedly, with slightly different wording, in multiple places on multiple days. But, using the same Apple calendar and data, Siri answered correctly on the iPhone.</p></blockquote><p style="margin: 0px 0px 1.6em; padding: 0px; color: rgb(238, 238, 238); font-family: Verdana, &quot;Bitstream Vera Sans&quot;, sans-serif; font-size: 11px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacin
g: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(74, 82, 90);">These sort of glaring inconsistencies are almost as bad as universal failures. The big problem Apple faces with Siri is that when people encounter these problems,<span class="Apple-converted-space">ย </span><em>they stop trying</em>.</p><!--EndFragment-->
</body>
</html>

See the difference? Chrome's fragment initially looks much more focused, but also provides much less information about the code's original context. Worse, Chrome's puts all the CSS information inline, and repeating each CSS property inside of each pasted html element makes the source a long ways away from DRY. And that out-of-context, inline CSS is lossily translated, which will cause us some rendering problems that we'll see and discuss in a bit.

Original Source

Now Edge's fragment isn't perfect. Here's the original html of that snippet straight from the server...

<p>Mossberg:</p>

<blockquote>
  <p>For instance, when I asked Siri on my Mac how long it would take
me to get to work, it said it didnโ€™t have my work address &#8212;
even though the โ€œmeโ€ contact card contains a work address and
the same synced contact card on my iPhone allowed Siri to give
me an answer.</p>

<p>Similarly, on my iPad, when I asked what my next appointment was,
it said โ€œSorry, Walt, somethingโ€™s wrongโ€ &#8212; repeatedly, with
slightly different wording, in multiple places on multiple days.
But, using the same Apple calendar and data, Siri answered
correctly on the iPhone.</p>
</blockquote>

<p>These sort of glaring inconsistencies are almost as bad as universal failures. The big problem Apple faces with Siri is that when people encounter these problems, <em>they stop trying</em>.

NOTE: There's no closing </p> tag because I've cut the source at that spot.

Look how clean the original is! Why can't we just get that? Chrome's CSS injections and line flattening makes for a much less human readable snippet.

Edge is close, but still weird. Take another look. For some reason, Edge's fragment makes every tag uppercase like it's 1998. It also trashes Gruber's original (and thoughtful) whitespace. Also yuck. I mean, Gruber's using Markdown to write Daring Fireball. In a perfect world, I'd take that clipboard's html and turn it back into Markdown when it's pasted into MarkUpDown.

How do we get this original source?

Start and End Fragment

The important thing to catch is that the HTML Fragment format almost always includes a full html doc with html and body tags after the SourceURL and other metadata. Within that html code, there are markers, <!--StartFragment--> and <!--EndFragment-->, saying where the exact selection started and stopped.

Though Chrome immediately dives into your selected text (the "true" html fragment) immediately after the <body> tag, Edge gives you any open tags that come before your fragment too. In this case, you had three open div tags when you got to the highlighted selection: <DIV id="Box"><DIV id="Main"><DIV class="article">. Knowing this DOM could be really helpful if you were interested in retrieving the original CSS and formatting your fragment as close to the original markup as possible.

If you ignore context like this, though, it's an easy enough feat to take the string and parse out the "true fragment's" html by splitting the entire HTML Fragment string on <!--StartFragment--> and <!--EndFragment-->. Here's an early version of something I wrote to just take out the html, ignoring all the fragment metadata.

private string _parseHtmlClipboardFragment(string rawFragmentSource)
{
    string ret = rawFragmentSource;
    string delimiterStartAfter = "<!--StartFragment-->";
    string delimiterEndBefore = "<!--EndFragment-->";

    if (-1 < ret.IndexOf(delimiterStartAfter))
    {
        ret = ret.Substring(ret.IndexOf(delimiterStartAfter) + delimiterStartAfter.Length);
        if (-1 < ret.IndexOf(delimiterEndBefore))
        {
            ret = ret.Substring(0, ret.IndexOf(delimiterEndBefore));
        }
        else
        {
            ret = string.Empty;  // No luck, Ending not after Start; go back to nothing.
        }
    }

    return ret;
}

NOTE: I've got a nice library that parses all this up into a custom html fragment view model that I mentioned in TL;DR. I wouldn't and don't use the above code in most cases.

When we parse out that html and insert it into our Markdown from Chrome, however, we get some, um, interesting results.

NOTE: Remember that Markdown is a superset, so to speak, of html. You could have a "Markdown" file without any Markdown syntax that's pure html. So what I'm really saying is, "Let's see what those html snippets that we pulled from the fragment look like when they're injected into our regularly scheduled html."

I'm going to wrap these pastes in blockquote tags so that you can tell where they start and stop easily.

Edge Parsed Code

Mossberg:

For instance, when I asked Siri on my Mac how long it would take me to get to work, it said it didnโ€™t have my work address โ€” even though the โ€œmeโ€ contact card contains a work address and the same synced contact card on my iPhone allowed Siri to give me an answer.

Similarly, on my iPad, when I asked what my next appointment was, it said โ€œSorry, Walt, somethingโ€™s wrongโ€ โ€” repeatedly, with slightly different wording, in multiple places on multiple days. But, using the same Apple calendar and data, Siri answered correctly on the iPhone.

These sort of glaring inconsistencies are almost as bad as universal failures. The big problem Apple faces with Siri is that when people encounter these problems, they stop trying.

Though you should remember that the original source has been smeared together, losing its newlines, and all the tags are strangely capitalized, the source looks pretty nice. It's plain, and adopts the local styling for the most part, but the most important markup is still there -- the blockquote, the italics, links if they'd been included, etc.

Chrome Parsed Code

Mossberg:

For instance, when I asked Siri on my Mac how long it would take me to get to work, it said it didnโ€™t have my work address โ€” even though the โ€œmeโ€ contact card contains a work address and the same synced contact card on my iPhone allowed Siri to give me an answer.

Similarly, on my iPad, when I asked what my next appointment was, it said โ€œSorry, Walt, somethingโ€™s wrongโ€ โ€” repeatedly, with slightly different wording, in multiple places on multiple days. But, using the same Apple calendar and data, Siri answered correctly on the iPhone.

These sort of glaring inconsistencies are almost as bad as universal failures. The big problem Apple faces with Siri is that when people encounter these problems,ย they stop trying.

Um, ew. Wha' happen'd?

Unfortunately, even though Daring Fireball's entire page has the same styles applied, this inline kludge Chrome performs puts it only in each, in this case, <p> tag. So when we have a margins around paragraphs, there's nothing telling us to also put that color and style between the block-level tags. There's no surrounding tag to be a stand-in for, essentially, the body (or any other lost, open) tag.

Let's look at all the overhead we have for each p from that fragment:

<p style="
    margin: 0px 0px 1.6em;
    padding: 0px;
    color: rgb(238, 238, 238);
    font-family: Verdana, &quot;Bitstream Vera Sans&quot;, sans-serif;
    font-size: 11px;
    font-style: normal;
    font-variant-ligatures: normal;
    font-variant-caps: normal;
    font-weight: normal;
    letter-spacing: normal;
    orphans: 2;
    text-align: left;
    text-indent: 0px;
    text-transform: none;
    white-space: normal;
    widows: 2;
    word-spacing: 0px;
    -webkit-text-stroke-width: 0px;
    background-color: rgb(74, 82, 90);
">

And here's what we have from Chrome's Dev Tools' "Computed" style tab:

color: rgb(238, 238, 238);
display: block;
font-family: Verdana, "Bitstream Vera Sans", sans-serif;
font-size: 11px;
height: 19px;
line-height: 19.8px;
margin-bottom: 17.6px;
margin-left: 0px;
margin-right: 0px;
margin-top: 0px;
padding-bottom: 0px;
padding-left: 0px;
padding-right: 0px;
padding-top: 0px;
text-align: left;
text-size-adjust: 100%;
width: 425px;
-webkit-margin-after: 17.6px;
-webkit-margin-before: 0px;
-webkit-margin-end: 0px;
-webkit-margin-start: 0px;

But note that that's essentially all from one css file, as properties from user agent stylesheets are simply the browser's defaults:

Paragraph styles for Daring Fireball

If you'd like, we can compare the fragment's CSS vs. the stylesheet. Spoiler: It's a mess. Not a lot of matching. Click here to see. Ugly.

The bottom line is that background-color: rgb(74, 82, 90); is something that belongs to the body, but is inserted here by Chrome into its HTML Clipboard contents as something that's attached to each block element. That's wrong. It doesn't display correctly, and it needlessly clutters the resulting HTML.

With HtmlFragmentHelper, I have a couple of choices. One is to strip all inline styles out, or at least some subset of styling. If I want to maintain as much of the original as possible, I might just blast anything that injects colors into the html, like color and background-color. Or I could blast all the inline style info and get something more like Edge's code.

Or I could normalize the CSS and optionally (via a property on HtmlFragmentViewModel) wrap it all in a block-level tag that contains CSS that's everywhere. Or at least in all of the top-level tags. But now I'm starting to write an html parser, which is a little outside of my current scope.

I think I'm going to do the former, though it's not in there as of this writing. That Chrome paste looks horrible as is, though. If only there was an easier way...


HTML Fragment from Firefox

There are certainly other applications that create HTML Clipboard values. I ran into LibreOffice's when using Calc, their spreadsheet app, and its html clipboard doesn't include Start and EndFragment tags. Wonderful, folks, wonderful.

But isn't there one glaring omission I might want to add here? You know, I think there is at least one...

For fun, let's take a look at Firefox's fragment too. Firefox isn't as popular as it used to be, but it's probably worth taking a look. (Remember that this is all on Windows, and Safari is dead enough by now that I'm not going to bother with v5.)

Version:0.9
StartHTML:00000156
EndHTML:00001060
StartFragment:00000190
EndFragment:00001024
SourceURL:http://daringfireball.net/2016/10/mossberg_siri
<html><body>
<!--StartFragment--><p>Mossberg:</p>

<blockquote>
  <p>For instance, when I asked Siri on my Mac how long it would take
me to get to work, it said it didnโ€™t have my work address โ€”
even though the โ€œmeโ€ contact card contains a work address and
the same synced contact card on my iPhone allowed Siri to give
me an answer.</p>

<p>Similarly, on my iPad, when I asked what my next appointment was,
it said โ€œSorry, Walt, somethingโ€™s wrongโ€ โ€” repeatedly, with
slightly different wording, in multiple places on multiple days.
But, using the same Apple calendar and data, Siri answered
correctly on the iPhone.</p>
</blockquote>

<p>These sort of glaring inconsistencies are almost as bad as universal 
failures. The big problem Apple faces with Siri is that when people 
encounter these problems, <em>they stop trying</em>.</p><!--EndFragment-->
</body>
</html>

Oh, so beautiful. Same whitespace as the original, with the added bonus (?) that the final paragraph is also text wrapped. No crazy inline CSS attempted. No header overhead or 1998-style HTML TAGS.

You could argue that Chrome is better because it has that inline CSS (yuck!) or Edge because it has the full page headers, but honestly, we're only one step farther away from what Edge gave us. We have the URL here. If we have network access to get CSS, we have network access to read the header and figure out where the CSS lives ourselves. Don't overcomplicate things. Beautiful. Good looking code continues to look good when pasted inline with Markdown.

Firefox Parsed Code

And Firefox's snippet is just as nice when pasted.

Mossberg:

For instance, when I asked Siri on my Mac how long it would take me to get to work, it said it didnโ€™t have my work address โ€” even though the โ€œmeโ€ contact card contains a work address and the same synced contact card on my iPhone allowed Siri to give me an answer.

Similarly, on my iPad, when I asked what my next appointment was, it said โ€œSorry, Walt, somethingโ€™s wrongโ€ โ€” repeatedly, with slightly different wording, in multiple places on multiple days. But, using the same Apple calendar and data, Siri answered correctly on the iPhone.

These sort of glaring inconsistencies are almost as bad as universal failures. The big problem Apple faces with Siri is that when people encounter these problems, they stop trying.

I miss Firefox.

Labels: , , , , , , ,