Running with Characters


When working with shape text through automation, the Text property of a Shape is generally sufficient to get and set the text.  However, this simple property ignores the fact that Visio text may have rich formatting applied or that the text is composed of text fields.  If you need a more industrial-strength way to retrieve and modify shape text, use the Characters object for the Shape.

 

Characters is a kind of do-it-all object for manipulating text.  It is capable of breaking text down into individual characters or sets of characters.  The Characters object can describe text in much the same way that the Visio Shapesheet stores shape text.  Consider the following shape with text:

 

 

The shape text consists of a single line with a variety of formats applied to the characters.  Visio has to keep track of not just the characters but their formatting as well.  It does this using text runs.  A text run is a set of characters that have something in common.  Typically a text run is a set of characters with identical formatting.  That is how Visio represents text in the Shapesheet.

 

 

Our sample text is described by six text runs.  The number in the first column defines how many characters are in the run.  Thus “Hello “ is described by the first line.  “World” is in the second line.  The space after “World” is next.  Then “Sample “, then “Text”.  Finally Visio (starting in 2003) terminates text with a non-visible character.  Note that this final character is not reported when using the Shape.Text property, but it is reported using Shape.Characters.Text.  Our sample text varies mostly by the Style property, which defines whether there is any bold, italics or underlining.  Also the Color property varies for one text run. 

 

There is also a notion of a paragraph run, which is the collection of all characters with the same paragraph formatting properties.  Note that, like a text run, a paragraph run can span across carriage returns because it describes the formatting not the structure.  The number in the first column defines the number of characters in the run.  In our example all the characters in the shape text use the same paragraph formatting properties, so Visio put a zero in the column.  This would similarly apply to the Characters section if there were no variation in character formatting.

 

The Visio SDK has sample code demonstrating how to set up custom formatting on shape text, but there isn’t an example of reading the text back.  Let’s take a look at a function that can analyze Visio shape text and return the specific text runs that describe the formatting boundaries.

 

using System;

using System.Collections.Generic;

using System.Text;

using IVisio = Microsoft.Office.Interop.Visio;

 

public class TextUtilities

{

    public static List<string> getTextRunList(IVisio.Shape vsoShape)

    {

        List<string> textRunList = new List<string>();

 

        // Get the Characters object representing the shape text

        IVisio.Characters vsoChars = vsoShape.Characters;

        int numChars = vsoChars.CharCount;

        int runBegin = 0, runEnd = 1;

 

        // Find the beginning point and end point of every text run in the shape

        for (int c = 0; c < numChars; c = runEnd)

        {

            // Set the begin and end of the Characters object to the current position

            vsoChars.Begin = c;

            vsoChars.End = c + 1;

 

            // Get the beginning and end of this character run

            runBegin = vsoChars.get_RunBegin((short)IVisio.VisRunTypes.visCharPropRow);

            runEnd = vsoChars.get_RunEnd((short)IVisio.VisRunTypes.visCharPropRow);

 

            // Set the begin and end of the Characters object to this run

            vsoChars.Begin = runBegin;

            vsoChars.End = runEnd;

 

            // Record the text in this run

            textRunList.Add(vsoChars.TextAsString);

 

            // As the for loop proceeds, c is set to the end of the current run

        }

 

        return textRunList;

    }

}

 

This C# code uses a .NET 2.0 Generic string list to store each text run found.  The Characters object is retrieved from the text, and the total character count is stored.  Then the procedure iterates through the characters from beginning to end.  However, the For Loop is unusual because it does not increment on a character by character basis.  Let’s examine the For Loop more closely.

 

The purpose of the For Loop is to find text runs and add them to the string list.  First a substring within Characters is defined by setting the Begin and End properties of Characters.  This substring is initially just the next two characters in the text.  Then the boundary of the text run containing those characters is established by calling RunBegin and RunEnd.  These methods search for the first and last characters of the text run containing the substring.  A new substring is defined by updating the Begin and End properties to match RunBegin and RunEnd.  This substring is added to the text run list.  Since the procedure has located the end of the current run, the For Loop is incremented to start at this position when looking for the next text run.

 

Once you have a list of the text run strings, you can match them with the property information retrieved from the Characters section of the Shapesheet.  This gives a more complete description of the text contents.  There are other uses for RunBegin and RunEnd beyond character formatting.  By passing other flags to the methods, you can get the paragraph runs instead or even individual words or text fields.