User Tools

Site Tools


geoinfo2223:groupb:start

Contributors: Sindhya Babu, Kiara Meço, and Sahil Chande

Webscraping of Water gauge Stations from Emscher Genossenschaft Lippe Verband website

Introduction

The Emscher Gennosenschaft Lippe Verband provides open raw data on the water level and discharge with daily updated values of the Emscher and Lippe area. The data is updated approximately every 15 minutes and two versions which is intranet (for registered users) and public versions are published. In our project, we have used open public data.

Project aim

The project aim is to scrape time-varying data on Water level and discharge from the website of Übersichtskarte Pegelstände Emscher Lippe continuously using the python library beautiful soup and also geo pandas and save them to the PostgreSQL database. Then, we have geo-referenced five Pegel (or Water) stations on the map using QGIS along with plotting the stations precise location on the map. https://howis.eglv.de/pegel/html/uebersicht_internet.php

Tools and packages used

  • Python
    • For web-scraping: BeautifulSoup, pandas, numpy, and requests.
    • For the creation of geo data frame: geoPandas, pyproj, shapely.geometry.
    • For database connection to PostgreSQL: sqlalchemy, psycopg2.
  • PostgreSQL
    • Database to store data and geometry
  • Pg Admin 4 and PostGIS extension for PostgreSQL
    • UI for easier operations with PostgreSQL
  • QGIS
    • Application used for plotting different graphs, maps and georeferencing the stations to their precise locations.

One-Time Scraping of Master Data of the gauges ("Pegelstammdaten")

The Base data (Stammdaten in German) provided contains information such as Station number (Pegelnummer), Water body if it is either Lippe or Emscher (Gewässer), River length (Flusskilometer) in km, Level zero-point(Pegelnullpunkt), above sea-level in mNN, total Catchment area (Einzugsgebiet) in km², Easting (Rechtswert) and Northing (Hochwert) Gauss Krüger co-ordinates with Mean High Water level (MHW) in cm, Mean Lowest Water level (MNW) in cm and Medium Water level (MW) in cm for the periods from 2001 to 2010. In addition to this, the image of the Pegel Station and the map showing the location of the Pegel station is displayed. Firstly, we scrape the text displayed for the Pegel station and also the corresponding map for each station and store it locally. The below image shows an example of the Master data for Station KA Hamm.

Figure 1: Pegel Master data of station KA Hamm

https://howis.eglv.de/pegel/html/stammdaten_html/MO_StammdatenPegel.php?PIDVal=32

To determine the above-mentioned values for all the Stations, we scrape the website using Python, beautifulSoup package. We loop over 200 PIDVal to get the master data of all the stations possible. To achieve this, the text stored under the HTML tags needs to be identified by inspecting the web page. Consider the example of Station KA Hamm, where it can be seen that the master data text is stored under <div id =” datacontainer” and <tr class=” normtext” html tags. The name of the station is contained in <div id =” popupcontenttitle” and the map image is however stored under the tag <div id =”mapcontainer” and <img src=. This is showed under Figure 1.

Figure 2: Inspecting source code to determine html tags to be extracted.

The data extracted for one station is shown below. The data frame contains two values ‘Station’ and ‘Station Values’. The Station Values column is then split to several columns and renamed and stored as a new data frame.

Figure 3: Python code showing extracting text of station name and values for KA Hamm

After looping over, we found that several PIDVal contained no data. We drop these rows with no data and now store the new data frame with non-null values. The new data frame contains 131 Stations and only 103 stations had geo-coordinates data available as shown under figure 4.

Figure 4: Data frame showing the data types and a number of non-null column values.

The geo-coordinates values of Rechtswert and Hochwert stored in the above data frame are still of data type float63. Since we want our coordinates to be recognized as geographic location data, we use the geoPandas package in python, to convert the pandas data frame to a geo data frame or gdf. Since a geo data frame requires a shapely object, we pass the columns containing Easting and Northing values i.e Rechtswert_(Gauss-Krüger), Hochwert_(Gauss-Krüger) respectively into the function points_from_xy to transform it to shapely points.

The below figure shows an example of what geo data frame, gdf looks like.

Figure 5: Geo data frame containing geometry column as shapely points

Storing the master data of Water Stations in PostgreSQL database

We create a database env_db and a new schema named ‘eglv’ is created under the database using super user env_master. Under this schema, we create a table ‘eglv_stations’ and upload the geo data frame to the table ‘eglv_stations’. The connection to the PostGIS database from python is enabled by creating a connection engine using sqlalchemy package and we pass this connection engine to_postgis. With chucksize=100, 100 rows will be written at a time to the database. This is shown under figure 6,7. But since the data frame contains only 131 rows, chuksize does not play a significant role when compared to data base with larger values.

Figure 6: Creating schema evgl with super user env_master

Figure 7: ‘eglv_stations’ table created under schema eglv shown in PgAdmin 4

Next, we use a select query to query the table ‘eglv_stations’ to get all the rows and check if all the data has been uploaded correctly.

Figure 8: Selecting all the rows from table ‘eglv_stations’.

Plotting the co-ordinates in Qgis

In QGIS we select the EPSG: 31466 as the Projected Coordinate Reference System (CRS) which is the DHDN / 3-degree Gauss-Kruger zone 2 corresponding to the co-ordinate system used by the Emscher Genossenschaft Lippe Verband. We first add the PostGIS layer and connect it to our database. After successfully connecting to the database by entering the superuser credentials, we can see that the eglv schema and eglv_station are available, as shown in the below figure.

After a successful connection to Postgis.

As a base layer, we add WMS layer - > NW Digitale Topographische Karten DTK100 Farbe Map is added, also projected as EPSG: 31466 coordinate system as shown in the below figure. The inverted triangles indicate the location of the stations.

Here in the below figure, we can see the zoomed-out map with all stations with dark red dots with the same map Topographische NRW DTK100 Farbe and also projected in EPSG: 31466 coordinate system.

Figure 9: The station locations plotted on NRW Topographische Karte Map in EPSG: 31466 CRS

Figure 10 shows the snippet of the location of a few of the stations with a scale of 1:1000000. Dark red dots are used to mark the station on the WMS layer.

Figure 10: The station locations plotted on NRW Topographische Karte Map in EPSG: 31466 CRS on scale 1:1000000

While plotting exact points on the map it is also important to take a background map similar to the one we have for referencing. Here in figure 11 below it can be seen that the first image is the selected QGIS map for plotting stations and the second image shows the map which they have on the website. Both maps show the location of the station in KA Hamm.

Figure 11: Comparison between KA Hamm Station in QGIS Vs KA Hamm Station in Emscher Genossenschaft Lippe Verband web page.

In figure 12 we can see that all the stations are listed on the Emscher Genossenschaft Lippe Verband web page with coordinates data shown below with custom-made location markers in dark blue color.

Figure 12: All stations which are listed on the Emscher Genossenschaft Lippe Verband web page marked with a custom symbol.

Periodic Web Scraping of 'Aktuelle Pegelstände für Emscher und Lippe'

The objective of this part of the project was to periodically scrape water level and discharge data for the Emscher and Lippe rivers from the website https://howis.eglv.de/pegel/html/uebersicht_internet.php. The scraped data was then stored in a database and visualized using temporal controllers in QGIS. The purpose of this report is to document the methodology used, the findings, and the conclusions of the project.

Web Scraping

To scrape the water level and discharge data from the website, we used the Python programming language and the Beautiful Soup library. The data was scraped for a period of 21 days, from March 6, 2023, to March 27, 2023, for different periods of time throughout the day. The data was collected for 12 stations: Fusternberg, Lünen, Hamm, Mengede, Dorsten, Recklinghausen, Am Stadthafen, Bahnstraße, Dinslaken, Konrad-Adenauer-Straße, Bottrop, Essener Straße, Bottrop-Süd, Haltern, Gelsenkirchen, Adenauerallee.

The code we have used to do the web scraping is shown in the figure below:

Figure 13. Web Scraping code

The code uses Python to perform web scraping of water level and discharge data from the website https://howis.eglv.de/pegel/html/uebersicht_internet.php. The purpose is to periodically collect data over a range of dates, and store it in a Pandas dataframe for visualization purposes.

The code uses the requests library to send HTTP GET requests to the website and receive responses. Then, it uses the BeautifulSoup library to parse the HTML content of the response and extract relevant information. Specifically, it extracts the location, water level, discharge, date, and time for each station using a for loop that iterates over each div tag with the class 'tooltip'.

The extracted information is then appended to a list called 'tooltip_info'. Next, the code creates a new Pandas dataframe from the list 'tooltip_info' and stores it in the variable 'new_df'. Finally, the code concatenates the new dataframe 'new_df' with an existing dataframe 'df', and updates 'df' with the concatenated result. This process is repeated every 5 minutes using the time.sleep() function.

The resulting dataframe contains the water level and discharge data for the 12 stations over the range of dates from March 6th to March 27th, 2023, which can be used for visualization in QGIS using the temporal controller.

After scraping the data from the website and removing duplicates, we needed to create a unique identifier for each station and combine the date and time columns into a single column with datetime format.

Figure 14. Adding SID and timestamp

To achieve this, we first made a copy of the cleaned DataFrame df1 using the copy() method, as shown in Figure 13 and assigned it to a new variable df_copy. Then, we created a new column named SID in df_copy which groups the data based on the location column and assigns unique integer values to each group using the groupby() and ngroup() methods. This allows us to uniquely identify each station in the data.

Next, we combined the date and time columns in df_copy into a new column named timestamp. This was done by concatenating the two columns with a space in between using the + operator.

Finally, we converted the timestamp column into a datetime format using the pd.to_datetime() method and provided the format string '%d.%m.%Y %H:%M' which specifies the format of the date and time strings in the column.

The resulting modified DataFrame df_copy contains the original columns as well as the newly created SID and timestamp columns which will be used for further analysis and visualization.

After eliminating duplicates and NaN values, we performed data cleaning by creating two new dataframes named 'new_df2' and 'new_df3'. 'new_df2' contains only the 'timestamp', 'SID', and 'Water_level' columns from the previous data frame named 'new_df1'. A new column named 'VAL' is added to 'new_df2' with 'cm' as its value for all rows. The 'Water_level' column is renamed as 'PARAM', which stands for PARAMETER. This is shown in Figure 15.

Similarly, 'new_df3' contains only the 'timestamp', 'SID', and 'Discharge' columns from the previous data frame named 'new_df1'.

Figure 15. Creating specific dataframe for Water Level and Discharge

Then the two cleaned DataFrames, 'new_df2' and 'new_df3', were concatenated into a single DataFrame 'joined_df', which contained the 'timestamp', 'SID', 'PARAM', and 'VAL' columns from both the original DataFrames. This was achieved using the 'pd.concat()' function, with the two DataFrames and the axis set to 0.

Figure 16. Creating a new DataFrame with both parameters

In order to create a DataFrame containing the station ID, location name, and coordinates for each location, we performed several data manipulation tasks. Firstly, we read a CSV file containing location data and stored it in a new DataFrame. Then, we created a new column in another DataFrame that contained the first word of the 'location' column. Next, we created a new DataFrame that contained only selected columns and removed any duplicate rows based on these columns. We converted the 'location_num' column of the new DataFrame from object to integer data type. We merged the new DataFrame with the location data DataFrame based on the matching 'location_num' and 'Pegelnummer' columns. We set the coordinates to (2571918.768, 5710313.487) for location_num = 10119 which was missing in the merged DataFrame. Finally, we dropped the 'location_num' column from the merged DataFrame. The result will be as in Figure 17.

Figure 17. Locations DataFrame

Data Storage

The scraped data was stored in a PostgreSQL database. Once we established the connection with the database, we stored two dataframes named eglv_param and eglv_stations in the database. These dataframes were obtained from Figure 16 and Figure 17, respectively.

After getting the data from this website: https://howis.eglv.de/pegel/html/uebersicht_internet.php We stored the information in 2 tables.

1 - Station’s Location in eglv_locations table

2 - Station’s Parameters (Water Level and Discharge) in eglv_param table

Now we will create 2 tables, one for each parameter.

1 - For Discharge: stations_q_values

2 -For Water Level: stations_w_values

We will add each of these tables as a layer on my project in QGIS. Here we will demonstrate how a temporal controller is created for one parameter and it would be the same even for the other.

Data Visualization:

The data was visualized using QGIS, an open-source geographic information system. The temporal controller feature in QGIS was used to create an animation that showed the changes in water level and discharge over time. The animation was exported as a video file and included in this report.

After adding the layer to the project we will define Symbology in Layer Properties.

Type - Graduated

Value - PARAM

Classes - 20

After that we define Temporal Controller settings as above:

Configuration - Single Field with Date/Time

Field - timestamp

After applying those properties the view will be as below:

we have also added here a Text Label that shows date and time when the temporal controller moves. This can be found at: View → Decorations → Title label by adding the expression as below:

It is possible also to add the legend in View → Decorations → Images. In temporal controller toolbar there is a save option:

Now we are going to save Map animations into folder Geo and then will use these “.png” images to create a video.

Folder geo looks like this:

Then we can use an external tool to create the video out of these images. We used https://clideo.com/image-sequence-to-video and the result is as above:

For water level

For Discharge

Findings:

Based on the data we collected, it was found that the Emscher river had more changes in water level in the stations where data was collected compared to the Lippe river. In terms of discharge, we only had data available for a limited number of stations including 20012 KA Hamm, 20004 Dorsten, 10099 Dinslaken Konrad-Adenauer-Straße, and 10103 Bahnstraße. However, it was observed that even these stations had variations in discharge over time.

Save Pegel Station’s images in PostgreSQL

We can store images in PostgreSQL by using the byte data type. The byte data type is used to store binary data, including images, in a PostgreSQL database. To store an image in PostgreSQL, we would need to convert the image to a binary format, such as a byte array, and then insert the binary data into a bytea column in our table. We used the code below to save images in database:

We firstly import the necessary libraries: requests, BeautifulSoup, os, io, PIL, and psycopg2 and set the path of the proj folder to the proj_lib_path variable and adds it to the PROJ_LIB environment variable using the os.environ dictionary. This is necessary for properly reading geographic coordinates in the data.

We then connect with the env_db database as the user env_master with the password M123xyz, hosted on localhost at port 5432. Later we create a new table called images in the env_db database with three columns: id (an auto-incrementing serial primary key), pid_val (an integer value from 1 to 100 representing the water level gauge), and image (a binary data type that will store the image data for each water level gauge). It then loops through the values of pid_val from 1 to 100, extracts the image data for each corresponding water level gauge from the website, and saves the image data to the images table in the PostgreSQL database. Finally we commit the changes made to the database and close the connection.

In database we will have this table:

If we add this layer into QGIS project the attribute table will look like this:

Georeferencing 5 stations from the scrape data

1. Krudenburg : https://howis.eglv.de/pegel/images/stationpics/maps/20002_stadtplan.gif

a. OSM as base layer

b. DTK farbe as base layer (NRW Topographische Map)

2. KA Hamm: https://howis.eglv.de/pegel/images/stationpics/maps/20012_stadtplan.gif

a. OSM as base layer

b. DTK farbe as base layer (NRW Topographische Map)

3. 20018 HRB Rapphofs Mühlenbach, Zulauf https://howis.eglv.de/pegel/images/stationpics/maps/20018_stadtplan.gif

a. OSM as base layer

b. DTK farbe as base layer (NRW Topographische Map)

4. OB- Königstraße https://howis.eglv.de/pegel/html/stammdaten_html/MO_StammdatenPegel.php?PIDVal=18

a. OSM as base layer

b. DTK farbe as base layer (NRW Topographische Map)

5. Bottrop Süd

https://howis.eglv.de/pegel/images/stationpics/maps/10008_stadtplan.gif

a. OSM as base layer

b. DTK farbe as base layer (NRW Topographische Map)

Conclusion

In conclusion, the web scraping and data visualization project was successful in collecting and visualizing water level and discharge data for the Emscher and Lippe rivers. The project demonstrated the usefulness of web scraping in collecting data for analysis and the effectiveness of using QGIS for visualizing temporal data. Further analysis could be conducted to investigate the factors that contribute to the variation in water levels and discharge over time.Storing the data in a database, as we did in this project, can make it more efficient to access and analyze. Additionally, having historical data on water level and discharge can be useful in building prediction models that can help mitigate the impact of floods. By understanding the patterns and changes in water level and discharge, we can better prepare and respond to potential floods, which can save lives and reduce damage to property and infrastructure.

Additional References

geoinfo2223/groupb/start.txt · Last modified: 2023/03/31 23:39 by sahil001