Create KML file by extracting coordinates from DOCX file

It’s been a while since I am writing again. This is not a very interesting topic to create a post with. I use to receive coordinates and name of locations from field staffs of different part of the country. I use to plot these locations over maps and create some reports. Pretty boring stuff! Most of the time these locations are neatly stuffed inside Microsoft word file, commonly in DOCX or XLS files. But recently they sent me 10 similar DOCX file with 128 locations listed. And I am going to create KML file with each of them using Python.

The DOCX files have several tables to organize different items. These tables are, unfortunately, have texts with coordinates of different locations in no strict fashion. There are also the name of the locations and a small descriptions. To create this task more subtle, the coordinates didn’t follow any strict DMS-standard, there are sometime DD in there too. The N and E are sometime put at the front, sometime they are at the end of the coordinates. So you see the Python string operations are useless.

I need to get into the DOCX (using the docx library), search each line of text inside each row of each table using regex pattern. If a suitable XY pattern is found, use this to create KML files. Each time a KML created, the name of the file will be a reference id (see the pic), the description will be also be added.

This is where it will start. I’ll wrap each DOCX file with a folder so that the KML files stay organized.

Now its time for the actual code.

I know I could have used a different library for creating KMLs, but I liked it this way. You can even use this to covert the coordinates and create something else.

2 Comments

  1. Hello;

    Thanks for writing about this, I am a beginner using Python and before I even attempt this; I would like to ask if this approach would work if the coordinate information was written within the document in a different fashion. Please see example below (it is written in Portuguese, however, you could get the idea), the coordinates are written in 6 pages, this is a site with an area of 1,960 hectares, a perimeter that is 24,000 meters long and 189 vertices. The information I want to capture is:
    Vertice name
    Northing
    Easting
    Azimuth and
    Distance

    “AO NORTE:Inicia-se a descrição deste perímetro no vértice D58-M1815, de coordenadas N 9.420.222,360m. e E 607.789,280m., deste, segue com azimute de 77°01’00” e distância de 53,06 m., até o vértice D58-M1813, de coordenadas N 9.420.234,280m. e E 607.840,980m….”

    Thanks
    Miguel

    • Hi Miguel, it doesn’t matter how the coordinates are written since you know regex. Start with capturing only the items you need. Use regex101.com, through this string to the ‘test string‘ box and start testing. For example, I could capture the latitudes by using N.\d+.\d+.\d+.\d+ from your text. This is an ugly one, but at least starts something. Use one variable for each item, loop over a group of similar items and send it to an array, use the end array to write the document.