Eric Cressey

Tech writer - Content strategist - Developer

Programmatically applying regular expressions

| Comments

As content owners, we’re sometimes asked to take on big projects to maintain that content. For example:

  • Last year at Symantec, we stopped using the verisign.com domain. We needed to update all VeriSign URL references to Symantec URLs in our products and emails. The project scope was all the URLs in more than 20,000 files.

  • At a previous company, we used several old, large Flare projects. Some of the files had weird HTML that caused issues with our CSS files. We needed to remove this legacy content from the 10,000 page project.

Big projects like these go beyond what you can do with a few regular expressions. The expanded scope requires more planning and more regular expressions applied to more locations.

To tackle these projects, I wrote a program to apply regular expressions to the files in a directory. This program let me to focus on writing regular expressions and made it easy to test when I wanted to measure my progress. You can get the C# program here.

Getting started with the programmatic approach

Let’s take a look at how to customize the program for your own purposes.

The program has a main function and two named functions:

  • Main function
  • ProcessDirectory
  • ProcessFile

Updating the directory location in the Main function

The main function contains a variable for the directory you want to update.

1
2
3
4
    string directory = @"C:\Users\username\Desktop\SampleDocs";
    //For option 2, the program will use the directory where the .exe file is.
    //Use the following line instead of the previous one (line 25). 
    //string directory = Path.GetDirectoryName(System.Reflection.Assembly.GetExecutingAssembly().Location);

By default, the directory variable points to a folder. If you want to point to a specific folder, there’s no need to build the application to run it. Instead, just hit the Start button on the Visual Studio toolbar.

If you want to run the program as an executable in the directory of your choice, you can do that by making a few changes:

1
2
3
4
    //string directory = @"C:\Users\username\Desktop\SampleDocs";
    //For option 2, the program will use the directory where the .exe file is.
    //Use the following line instead of the previous one (line 25). 
    string directory = Path.GetDirectoryName(System.Reflection.Assembly.GetExecutingAssembly().Location);

Now when you’re ready to test, build the application (CTRL + SHIFT + B) and then grab the RegexForWriters.exe file from the visual studio project’s RegexForWriters\RegexForWriters\bin\Debug folder.

Run the .exe file in the directory you want to update.

Setting allowed file extensions with ProcessDirectory()

If you only want to process specific file types, you can make a few changes to the ProcessDirectory method. Here’s the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
    //get all the files in the folder and process them
    string[] fileEntries = Directory.GetFiles(directory);

    List<string> file_list = new List<string>(fileEntries);
    try
    {
        Parallel.ForEach(file_list, file =>
        {
            //you can apply edits to specific file types. 
            //If you don't want to specify file types, delete lines 59 and 61.
            if (Path.GetExtension(file) == ".txt" || (Path.GetExtension(file) == ".html")) {
                ProcessFile(file);
            }
        });
    }

If you want to change the file types to edit, change this line: if (Path.GetExtension(file) == ".txt" || (Path.GetExtension(file) == ".html")) { As written, only .txt and .html files are processed. To process .properties, .xml, and .htm files instead:

if (Path.GetExtension(file) == ".properties" || (Path.GetExtension(file) == ".xml") || (Path.GetExtension(file) == ".htm")) {

Adding regular expressions to ProcessFile()

The ProcessFile method is the place you’ll add regular expressions and tell the program how to update the text. Here’s the entire method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
    public static void ProcessFile(string file)
    {
        /** This method:
         * 1. Gets all the text in a file
         * 2. Performs a series of regular expressions
         * 3. Saves the file with the updated text
         * */

        //get all text
        string text = File.ReadAllText(file);

        //use regex.replace function and pass in the text to search, regex, and replacement string
        text = Regex.Replace(text, @"(?<=Greeting=).*", "Hello, world!");

        //you can remove text by passing in a blank replacement string
        //use regex options at the end if you want to ignore case in your match
        //if the regex has quotes in it, escape them as shown here
        text = Regex.Replace(text, @"<p.*?class="".*?unnecessary.*?"".*?>.*?<\/p>", "", RegexOptions.IgnoreCase);

        //if you want to do a simple text replacement without regex, use the string.replace method
        text = text.Replace(@"Old text", "New text");

        /**
         *  Add more regular expressions here, as many as you like. 
         * */

        //Finally, save the file with the updated text
        File.WriteAllText(file, text);
    }

Use the Regex.Replace() method to apply regular expressions to the text. The method takes three or four arguments: Regex.Replace(text to update, regular expression, replacement text, regex options)

The text to update is the file text, stored in the text variable. After that, specify your regular expression. If your regular expression has quotes in it, you may want to escape it with an @ at the beginning and then an additional quote before each quote, as shown here: @"<p.*?class="".*?unnecessary.*?"".*?>.*?<\/p>". Always put the regex and replacement text in quotes because they’re strings.

Using group text in replacements

One of the cool things about regex groups is that you can reference group values in your replacement text. This can save a lot of time. The syntax is the same as usual. Read this post to learn more about regex groups.

1
2
    //you can reference capturing groups in your replacement string by number, $1, $2, etc.
    text = Regex.Replace(text, @"<span.*?class=""bold"".*?>(.*?)<\/span>", "<strong>$1</strong>");

More uses for this program

Because this program quickly updates files in a directory, it is broadly applicable to file-related grunt work. You might use it to: * rename files in bulk * customize a web help output by inserting HTML and CSS and JavaScript references

Recently, we migrated some emails from XML to properties and I used this program to fetch content from XML and create the appropriate properties files. If you’re interested in using the programmatic approach to text editing tasks, this program is a great place to start.

Comments