How to Extract and Parse Geo-coding data using OpenStreetMap and Python?

tenthplanet blog pentaho Extraction and Parsing of Geo coding data using OpenStreetMap an

Introduction

OpenStreetMap(OSM) is a project to create a free editable map of the world. The differences between OSM and Google Maps are as follows:

  • OSM gathers local data from people while Google has different data sources.
  • Data granularity i.e., OSM has many sub-divisions of the data, unlike Google Maps.

Objective

Extracting data from OpenStreetMaps and converting it to CSV format using Python.

Extraction of data

Data is collected from various sources and loaded to data warehouse using Pentaho Data Integration. The Pentaho Data Integration Tool performs the cleansing, transformation, applying rules and stores in data warehouse.

The required data is available in the official website of OpenStreetMaps. It is downloaded as an XML formatted .osm file.

Converting XML to CSV

Conversion of XML to CSV looks simpler in the beginning but is difficult as the data is unstructured and unformatted.

Conventional Methods of Conversion

Usually, the below-mentioned methods work for structured XML files but due to inconsistency in the OSM data, these do not provide the required output.

Using R:-

  • The built-in library ‘Rio’ in R supports some file conversions. It supports XML to CSV conversion too. This doesn’t help in our case.
  • Parsing the XML file and storing it into the list also doesn’t work.

Using Python:-

Built-in libraries will not support direct conversion due to the structure of the data.

Methods that can be used (Python)

ElementTree package in Python helps us to parse, modify and query XML files. Knowing about the structure of the XML file and its components will help us in the conversion. Components include elements, tag, attribute and value.

The link given below explains with an example how to parse and query an XML file.

https://www.datacamp.com/community/tutorials/python-xml-elementtree

Some functions that can be used in the Python code:

ET.parse() – for parsing the XML file
getroot() – navigating to the root of the file
root.iter() – iterating through the file starting with the root

Conclusion

Conversion of the XML to CSV is difficult in the case of OpenStreetMap data due to its structure. The regular methods of conversion fail to serve our purpose. Therefore, we can try parsing the file and retrieving the data using the ElementTree package in Python.