This article was originally published in InDesign Magazine issue 55 (August–September 2013). Subscribe now!
The future of publishing may rest upon HTML. Whether or not that’s true, only time will tell. But there’s no denying that a vast amount of content has been structured and formatted in HTML. Typically, our challenge as designers is getting content out of InDesign as HTML, a task which I covered in my article “InDesign to HTML” in the April/May 2013 issue of InDesign Magazine. But there may also be times when we’re called upon to do the opposite—to take HTML content and bring it into the realm of print or PDF through InDesign. Currently, there’s no method for directly importing HTML into InDesign, but there are a few “unofficial” paths for bringing HTML in and preserving much of its structure and formatting.
Whether your preference for tackling this task is down and dirty, methodical and geeky, or somewhere in between, at least one of the approaches in this article should help you get your HTML into InDesign—and, with one exception, won’t require spending a dime, except for the cost of your own time. None of them are perfect, but they beat starting from scratch.
Bear in mind, too, that my goal here is not to take a full web page layout and recreate it in InDesign. These methods bring in only the text content from the HTML.
As a markup language, HTML identifies different types of content, like headings, paragraphs, and lists. The formatting of that content is described by code called a Cascading Style Sheet (CSS). The content and its formatting instructions are brought together and rendered by the web browser. Your challenge is to bridge the gap between the code and InDesign using one of the following four methods.
Method #1: Copy, Paste, and Hope for the Best
• No prep up front
• Type formatting (somewhat) preserved
• No styles generated for pasted content
• Bolds and italics are lost
• This option requires the least amount of up-front effort but comes with a huge “your mileage may vary” disclaimer.
Open an HTML file or web page using your web browser, select all of the desired copy in the browser, and copy it to the clipboard. Browsers will differ in how much, if any, formatting they preserve. For example, Safari on the Mac will retain much of the formatting you’ve copied, while Chrome preserves nothing. On the Windows side, the current version of the much-maligned Internet Explorer does probably the best job of all.
Before pasting what you’ve copied into InDesign, however, be sure you’ve set your clipboard handling preference (Preferences > Clipboard Handling) for “When pasting text and tables from other applications” to “All information (Index Markers, Swatches, Styles, etc.)” (see Figure 1). Otherwise, the content will come in as unformatted text.
Figure 1When you paste the copied text into InDesign, you’ll see some measure of the formatting preserved, depending on the browser you copied it from (Figure 2).
Figure 2In most cases, all text will come in with “No Paragraph Style” as its style, with all other formatting treated as overrides, and no character styles will be created (Figure 3).
Figure 3However, Explorer for Windows manages to hang on to heading attributes (H1, H2, etc.) and hyperlinks well enough that corresponding Paragraph Styles for H1, H2, and so on will be created, along with a Hyperlink Character Style.
From this point on, building the styles you want is up to you, but you’ll have the formatted text as a visual reference. To speed things up, you could download and install Thomas Silkjaer’s Auto Create Paragraph and Character Styles script. The script examines the formatting in the document and then creates and applies paragraph and character styles to the text. It does quite a nice job of keeping the number of styles to a minimum, too. The styles are generically named—AutoStyle1, AutoStyle2, and so on (Figure 4)—but renaming styles is a lot less work than creating them manually.
Figure 4Method #2: By Way of Word
• Free (unless you don’t own Word)
• Translates most tags and classes into styles
• Word can be very fussy about opening certain URLs
You can use Microsoft Word to act as a bridge between HTML and InDesign by using Word’s Open URL feature (File > Open URL). Simply enter a URL into the field, and Word captures the content from the destination page. One thing to note is that Word doesn’t resolve the “friendly URL” scheme (i.e., /name-of-article-all-spelled-out/) used by most sites today. It considers those URLs to be directory names, and Word needs a filename—preferably, one ending in “htm” or “html.” To get around this limitation, go to the web page and use your browser’s Save As command to save the content as “source”—meaning the actual HTML markup and content—and then open that saved “.html” file with Word.
Don’t expect Word to get the formatting right, however—the page won’t look like it does on the web when its content arrives in a Word document (Figure 5).
Figure 5That said, Word does an excellent job translating most HTML tags—headings, lists, etc.—into paragraph styles (Figure 6).
Figure 6In addition, paragraphs in the HTML with class attributes in the
tag result in Word paragraph styles named to match those classes. For example, text enclosed with a tag like will produce a Word style of “author_name.” Generic paragraphs get the basic “Normal” (or, sometimes, “Normal (Web)”) style applied. Bolds, italics, and hyperlinks automatically produce “Strong,” “Emphasis,” and “Hyperlink” character styles, respectively.
However, this method might bring more of the web content into Word than you want. Site navigation options, sidebars, and page footers will be in the document if you’ve saved the entire web page. In that case, simply select and delete those elements from the Word file, and then choose File > Save As. Give the file a name, and save it in either Word format (.docx) or Rich Text Format (.rtf) by choosing the appropriate option from the Format menu in Word’s Save dialog box.
When you import the Word or RTF file into InDesign, the styles will come along with it (Figure 7).
Figure 7Be prepared, however, that you’ll find overrides in abundance, and you’ll need to spend time cleaning up and redefining your styles. Word does an okay job, but it will get a lot wrong, especially if the incoming HTML isn’t clean, logically-ordered, and standards-compliant. Also, don’t expect to get a perfect representation of HTML content in InDesign. The best you can expect to achieve is greatly reducing the amount of manual re-formatting required, along with maintaining the content hierarchy, most style designations, and your bolds and italics.
Method #3: Get Hard-core with Hard Code
• Offers the most control; leverages existing markup
• Might be too geeky for some users; requires multiple steps; tables get very tricky
The method that gives you the most control over your HTML-to-InDesign conversion also requires a more hands-on approach to working with the source HTML. In this case, you’ll employ the power of GREP Find/Change to simultaneously use the HTML markup to apply styles and strip out that very same markup.
The best thing you can leverage from HTML is its structure and consistency—unless you’re working from very sloppy HTML that doesn’t comply to web standards, which we’ll assume is not the case here. Anything consistent and structured can be dealt with very effectively with some relatively simple GREP Find/Change operations.
First, you need to get to the source HTML. From most browsers, this is as simple as choosing File > Save As and choosing Page Source (or Source, or HTML, or other words to that effect). Once saved, the HTML file can be opened by any text editor (for example TextWrangler on the Mac, or Notepad on Windows). From there, you can select all of it (or just the portion you want), copy it to the clipboard, and paste it into InDesign.
Before moving on, you’ll want to take a look at the HTML to look for how many different types of markup tags you’ll need to deal with (and therefore styles you’ll need to create). For example, some markup is highly semantic, meaning that it’s simple and relatively free of class attributes. Some HTML will rely quite a bit on classes. That’s tricky, because you’ll need to create separate paragraph styles for each tag (H1, H2, P, LI, etc.), with and without classes, and a character style to correspond with each and tag, plus any tags. Initially, those styles don’t even need to be defined with anything other than a name. Attributes can be added later, but the styles themselves need to be present in the document before you can start practicing your GREP magic.
Once you’ve created the necessary styles, you can open the Find/Change dialog box, choose the GREP tab, and start searching for specific tags and the text that appears within them. For example, the regular expression (.+?)
will find the level 1 heading, including its tags, and (.+?)
will find all of the text in any single paragraph along with its surrounding tags. A search for (.+?) will find all text tagged to appear bold. These are just a few of several searches (Figure 8) you’ll need to run to find the different content and its tags.
Figure 8So what do you do with the text once you’ve found it? First, make sure your Find/Change dialog box is showing all of its options by clicking the More Options button (if you see a Fewer Options button, you’re already seeing everything). Next, enter the appropriate expression in the Find What field of the GREP area of the Find/Change dialog box. If you’re using expressions like the examples above, the (.+?) portion of the expression refers to the text you want to keep. Literally, it means “any one or more characters, but the shortest match.”
In the Change To field, just type $1, which will put back the text found within the HTML tags, but will discard the tags themselves. (Technically, it “puts back” the part that was inside the parentheses.) Then, click the small icon to the right of the Change Format area at the bottom of the Find/Change dialog box and, in the resulting window, select the paragraph style you want to apply to the found text. For example, if you’re searching for text within the and
tags, you’d choose your level 1 heading style—whatever you’ve named it. When you click Find, then Change (and I strongly suggest testing this by changing one or two before committing to Change All), InDesign will put back the text within the opening and closing tags, apply the style you’ve specified, and delete the tags, leaving you with styled text without any surrounding tags.
Dealing with hyperlinks
In this method of processing pasted HTML, removing the markup for hyperlinks while preserving the link information requires a three-part process. First, choose Convert URLs to Hyperlinks from the Hyperlinks panel menu (or choose Type > Hyperlinks & Cross-references > Convert URLs to Hyperlinks), and then click Convert All in the resulting dialog box. InDesign will add all of the links it detects as Shared Hyperlink Destinations and will automatically apply them to the URLs within the link anchor tag. However, the text between the markup (the tags) won’t have the newly-created links applied. That task falls on you.
The fastest way to find all the text within the anchor tags is to run a GREP-based Find/Change for any text preceded by a closing angle bracket and followed by a closing anchor tag. That expression—(?).+?(?=)—will select the content of the link tag but not the tags around it. For each search result selected, choose the desired hyperlink from the pull-down menu in the Hyperlinks panel, and then click Find Next in the Find/Change dialog box to continue on to the next result. Repeat this step until you’ve processed all the links.
Finally, once all the desired text has hyperlinks applied, you’ll want to remove the original markup. Here again, it’s easily done using GREP. In the Find/Change dialog box, on the GREP tab, enter the expression in the Find What field, leave the Change To field empty, and then click Change All. That removes every opening and closing anchor tag, leaving you with working InDesign hyperlinks on the remaining text.
Method #4: Seek Professional Help
• More automated than any of the other methods
• Out-of-pocket expense (but relatively low); results can be unpredictable with complex content
Sometimes the best solution is one someone else came up with. No one can know or do everything themselves, and there are a lot of very smart people out there creating solutions that fill in the gaps of InDesign’s feature set. One such solution, Rorohiko’s FramedWeb plug-in for InDesign ($39.00), tackles this very problem.
FramedWeb contains both an HTML parser and a CSS parser, and it allows you to create styled InDesign content from a URL, a local HTML file, or HTML copied and pasted into InDesign. Of the three methods, the first is the least reliable, which isn’t surprising considering that it’s the most ambitious. FramedWeb allows you to type a URL into an InDesign text frame and then choose Convert Web Content from the API menu (which is where Rorohiko plug-ins usually live). The plug-in goes to that URL, parses and gathers its content, and brings it into InDesign, generating character and paragraph styles in the process. The more complex the page, the less successful the plug-in tends to be.
On the other hand, it does extremely well in parsing source HTML (the actual markup) you paste into InDesign, especially if you grab only the desired content (for example, the body of an article without all of the surrounding web page elements) and run the plug-in on that (Figure 9 and Figure 10).
Figure 9 Figure 10FramedWeb creates paragraph and character styles from the imported HTML in a way that is very web-like. The “cascading” part of CSS refers to the method of controlling the most text at the root level of a style definition and then describing only variations to that root style where needed. All of the paragraph and character styles FramedWeb generates are based on a top-level style called HTML, and styles are arranged hierarchically using style groups to help make the dependencies more apparent. It’s logical from an engineering point of view, but most InDesign users will probably want to rework those styles to fit their preferred organizational schemes. Another caveat is that tables are not currently converted by FramedWeb. To bring in HTML tables as InDesign tables, you’ll have more success with methods 1 or 2.
FramedWeb does not claim to recreate a web layout or preserve its appearance. In fact, its documentation quite explicitly acknowledges that it doesn’t. There’s nothing out there that will, but FramedWeb is currently the most automated method of handling this kind of conversion. Rorohiko makes the fully functioning plug-in available free for a 30-day trial, giving you ample time to try it out on the kinds of content you may need to deal with and evaluate how well it matches your needs.
All Words and No Pictures?
Getting HTML content into InDesign with its hierarchy and as much formatting as possible preserved is a lot trickier than getting images from a web page, and only Method #4 in this article allows for bringing in both at the same time. The problem with web images is that they are optimized for the screen, typically at a resolution and size that allows for fast downloading and browser rendering. That kind of image won’t cut it for print, however. If you intend to repurpose your HTML content for print, you’re going to need the original images from which those web-optimized versions were produced, wherever possible. But if you’re simply creating an InDesign layout that will never be produced with ink on paper—a PDF, for instance—then maybe those lower-resolution image will suffice. So how do you go about getting them?
Nearly all web browsers share a common feature that lets you right-click any image, then choose an option (Save As, Save Image As, Save to Disk, etc.) for locally saving a copy of that image. That’s fine if you’ve only got a small handful of images to contend with, but if you have more than that, you’ll want to speed up that process. Some browsers have free extensions available that, once installed, will add multi-image saving capabilities. For Firefox, there’s Save Images, and for Chrome, there’s Image Downloader.
Safari has no built-in method for saving all web images at once, but you can quickly cobble this functionality together using OSX’s Automator utility (Figure 11).
Figure 11From Automator, simply drag three actions—Get Current Webpage from Safari, Get Image URLs from Webpage, and Download URLs (in that order)—from the Internet group of the Actions Library into the main workflow area of the application. You can set up actions to get the images on the page itself or that the page links to, and specify a folder where the images will be saved. Next, make sure the web page you want is in the frontmost active window in Safari, and click the Run button in Automator. Once the Action is complete, you’ll find all of the saved images in the folder you specified.
Choosing the Right Method for You
None of the four methods described in this article is a clear “winner” for every HTML importing and conversion scenario, but one of them—or a combination of several of them—can be applied in a manner that makes the best use of its particular strengths. They’re all meant to simply get you closer to your objective. Every project is different, no one’s work habits are the same, and not all HTML is created equal. The “right” choice is the one that gets you the desired result for a given task at a given time.
The key is to remember that the information you need to repurpose—from HTML content into InDesign content—is already there in the markup. The tags identify the type of content and its place in the hierarchy, classes specify unique formatting changes, and your most direct path to success is to leverage that information to save yourself time and preserve the essential structure and formatting of your content.
***Michael Murphy is an award-winning designer, InDesign expert, author of Adobe InDesign Styles, and co-author of Adobe Creative Suite Design Premium CS5 How-Tos: 100 Essential Techniques. He is also the author of the lynda.com courses Learning Grep with InDesign, InDesign Styles in Depth, and InDesign for Web Design. You can view his work at vinestreetdesign.com.
Inserted Article Image(s):
Article Slider image:
Liked This? Read These!
TypeTalk: Strip Word Formatting When Importing Text into InDesign and Quark
TypeTalk is a regular blog on typography. Post your questions and comments by clicking on the Comments icon above.Q. How can I avoid importing the formatting in a Word document when I place... Read More
Easy Fixes for Microsoft Word Formatting in InDesign
This article originally appeared on InDesignSecrets.com, which recently introduced free membership. Benefits include access to restricted content and discounts on goods and services. Read More
TypeTalk: Find Figure Styles in OpenType Fonts
TypeTalk is a regular blog on typography. Post your questions and comments by clicking on the Comments icon above. If Ilene answers your question in the blog, you'll receive one Official... Read More
10 Essential Tips for Working With Styles in InDesign
Styles are an essential tool for formatting text efficiently and consistently. Here are 10 important tips for getting the most out of text styles in InDesign. Read More
Print Asset Mgmt./Workflow
Print Design & Layout
Web Design & Layout