Python Pandas Tree Structure

8/5/2019

I have a XML document that contains a hierarchical, tree-like structure, see the example below.

Pandas Python Download

The document contains several <Message> tags (I only copied one of them for convenience).

Each <Message> has some associated data (id, status, priority) on its own.

Besides, each <Message> can contain one or more <Street> children which again have some relevant data (<name>, <length>).

Moreover, each <Street> can have one or more <Link> children which again have their own relevant data (<id>, <direction>).

Example XML document:

Parsing the XML with Python and storing the relevant data in variables is not the problem - I can use for example the lxml library and either read the whole document, then perform some xpath expressions to get the relevant fields, or read it line by line with the iterparse method.

However, I would like to put the data into a pandas dataframe while preserving the hierarchy in it. The goal is to query for single messages (e.g. by Boolean expressions like if status Active then get the Message with all its streets and its streets' links) and get all the data that belongs to the specific message (its streets and its streets' links). How would this best be done?

Offering the same tools that professionals around the world have been using every day for the last 20 years to edit feature films, dramas, news and sports. Lightworks download completo portugues crackeado hd.

I tried different approaches but ran into problems with all of them.

If I create one dataframe row for each XML row that contains information and then set a MultiIndex on [MessageID, StreetName, LinkID], I get an Index with lots of NaN in it (which is generally discouraged) because MessageID does not know its children streets and links yet. Besides, I would not know how to select some sub-dataset by Boolean condition instead of only getting some single rows without its children.

When doing a GroupBy on [MessageID, StreetName, LinkID], I do not know how to get back a (probably MultiIndex) dataframe from the pandas GroupBy object since there is nothing to aggregate here (no mean/std/sum/whatsoever, the values should stay the same).

Any suggestions how this could be handled efficiently?

DirkDirk

3,0695 gold badges32 silver badges62 bronze badges

1 Answer

I finally managed to solve the problem as described above and this is how.

I extended the above given XML document to include two messages instead of one. This is how it looks as a valid Python string (it could also be loaded from a file of course):

To parse the hierarchical XML structure into a flat pandas dataframe, I used Python's ElementTree iterparse method which provides a SAX-like interface to iterate through a XML document line by line and fire events if specific XML tags start or end.

To each parsed XML line, the given information is stored in a dictionary. Three dictionaries are used, one for each set of data that somehow belongs together (message, street, link) and that is to be stored in its own dataframe row later on. When all information to one such row is collected, the dictionary is appended to a list storing all rows in their appropriate order.

This is what the XML parsing looks like (see inline comments for further explanation):

listOfRows is now a list of dictionaries where each dictionary stores the information that is to be put into one dataframe row. Creating a dataframe with this list as datasource can be done with

and gives the 'raw' dataframe:

We can now se the columns of interest (messageId, streetName, linkId) as MultiIndex on that dataframe:

which gives:

Even though having NaN in an index should be disregarded in general, I don't see any problem with it for this usecase.

Finally, to get the desired effect of accessing single messages by their messageId, including all of its 'children' streets and links, the MultiIndexed dataframe has to be grouped by the most outer index level:

Now, you can for example loop over all messages (and do whatever with them) with

Pandas Python Download

which returns

or you can access specific messages by the messageId, returning the row containing the messageId and also all of its dedicated streets and links:

gives

Hope this will be helpful for somebody sometime.

DirkDirk

3,0695 gold badges32 silver badges62 bronze badges

Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.

Not the answer you're looking for? Browse other questions tagged pythonxmlpandastreehierarchical-data or ask your own question.

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

pandas is well suited for many different kinds of data:

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Here are just a few of the things that pandas does well:

Many of these principles are here to address the shortcomings frequently experienced using other languages / scientific research environments. For data scientists, working with data is typically divided into multiple stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. pandas is the ideal tool for all of these tasks.

Some other notes

Note:

I have a pandas DataFrame of unique rows which looks something like this:

Columns of df are ordered in parent-child linear relation, wherein column O is level 1, column B is level 2 and so on. The intention is to convert this df into a tree like structure for navigation purposes, which would look something like this:

Filtering on each column's each value in df (as parent) then copying all unique values of remaining columns on the right as child seems like a bad way to achieve this.

Is there an efficient way?

Mohi

MohiMohi

1 Answer

As I mentioned, we have this way to achieve this:

Now you do not need to purchase it with your money. We create keygen with full features as original game which you will buy online or shop. Blue game licence key code. What Is Blur Game CD Serial Key Generator:Who really want to enjoy this game for free then try to generate Blur activation codes for free of charges. But as i said upper it is expensive for some people so they want free of cost. If you will download this Blur serial keygen, you will be able to generate unlimited unique, origin working codes.It is very easy to use, even this blur license key generator will give you cd keys for all systems like xbox 360, ps3 and pc.

Mar 07, 2017 So I have had windows 10 on the computer for a while now.(ca. 1.5 years) About a week ago all the colors and backgrounds was messed up on only my dads user. Windows 10 screen colors messed up. Aug 23, 2015 Hello, I recently upgraded from Windows 8.1 to Windows 10, and it seems to off completely messed it up. Basically, the start menu won't open, i cant get to where I can go back, and task manager won't open without crashing, and several other.

Filtering on each column's each value in df (as parent) then copying all unique values of remaining columns on the right as child seems like a bad way to achieve this.

And the solution with same logic is here:

MohiMohi

Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.

resortmultiprogram