Using external data¶
It’s not always practical or desirable to store all the data required by a model in the JSON document. This is particularly true when working with large amounts of timeseries data, but is also applicable to data better stored in a tabular format.
Many parameter types support loading data from an external file using the "url"
keyword. The data is loaded during the setup/reset process, before the model is run.
Supported formats¶
Support for the following data formats is available via the Pandas module.
Comma separated values (.csv)
Excel (.xls, .xlsx)
HDF5 (.h5)
Warning: When reading Excel documents formulae are supported but not re-evaluated and links are not updated; the value read is the value present when the document was last saved.
When working with large amounts of timeseries data the HDF5 format is recommended as it has superior read speeds. Data in an HDF5 file should be in the “fixed” format; this is achieved by passing the format="fixed"
argument to DataFrame.to_hdf
. Where data access speed is critical, users are advised to look at the TablesArrayParameter
parameter instead which supports very fast access via pytables directly (rather than indirectly via pandas).
External data is read using the appropriate pandas.read_xxx
function determined by the file extension (e.g. pandas.read_excel
for xls/xlsx). Keywords that are not recognised by Pywr are passed on to these functions. For example, when reading timeseries data from a CSV you can parse the date strings into pandas timestamps by passing parse_dates=True
(see example below).
Examples¶
Timeseries¶
The most common kind of data to store in an external file is timeseries data. For example, the flow used by a Catchment
node.
An example dataset is given below with three columns: a timestamp (used as the index), rainfall and flow.
Timestamp |
Rainfall |
Flow |
---|---|---|
1910-01-01 |
0.0 |
23.920 |
1910-01-02 |
1.8 |
22.140 |
1910-01-03 |
5.2 |
22.570 |
The parameter below references the timeseries1.csv file in its "url"
. The index column is defined by "index_col"
and the data column is defined by the "column"
keyword (in this case, "Flow"
). The "parse_dates"
keyword is required in order to parse the dates from strings in the CSV file into pandas date objects.
"flow": {
"type": "dataframe",
"url": "timeseries1.csv",
"index_col": "Timestamp"
"parse_dates": true,
"column": "Flow"
}
If the index column of the dataframe is a timestamp the parameter will support automatic resampling, if required. For example, if the external data is on a daily timestep the model can still be run on a weekly timestep. In this case the data mean flow for the week is used. Only subsampling is currently supported (e.g. you can go from daily to monthly, but not monthly to daily).
Constants¶
Constant parameters can also load their data from an external source. This is useful when data with a common theme is stored in a table. For example, the demand for different nodes in the model.
An example is given below with the population and demand for three different cities (not real data!).
City |
Population |
Demand |
---|---|---|
Oxford |
30,294 |
20.3 |
Cambridge |
28,403 |
19.4 |
London |
790,930 |
520.9 |
As in the previous example, the filename is passed to the "url"
keyword. The "index_col"
keyword defines which column should be used for the lookups, with the "index"
keyword specifying the lookup key.
"max_flow_oxford": {
"type": "constant",
"url": "demands.csv",
"index_col": "City"
"index": "Oxford"
"column": "Demand"
}
Monthly profiles¶
Monthly profiles can also be loaded from an external data source. Instead of passing a "column"
keyword, the parameter expects the data source to have 12 columns (plus 1 for the index). The names of the columns are not important.
City |
Jan |
Feb |
Mar |
… |
Dec |
---|---|---|---|---|---|
Oxford |
23.43 |
25.32 |
24.24 |
… |
21.24 |
Cambridge |
11.23 |
14.34 |
13.23 |
… |
12.23 |
"max_flow": {
"type": "monthlyprofile",
"url": "demands_monthly.csv",
"index_col": "City",
"index": "Oxford"
}
Multi-index¶
Multi-indexing of dataframes is supported by passing a list to the "index_col"
keyword. Both numeric and string indexes are valid.
level |
node |
max_flow |
cost |
---|---|---|---|
0 |
demand1 |
10 |
-10 |
0 |
demand2 |
20 |
-20 |
1 |
demand1 |
100 |
-100 |
1 |
demand2 |
200 |
-200 |
{
"name": "DC1",
"type": "output",
"max_flow": {
"type": "constant",
"url": "multiindex_data.csv",
"column": "max_flow",
"index": [0, "demand1"],
"index_col": ["level", "node"]
},
"cost": {
"type": "constant",
"url": "multiindex_data.csv",
"column": "cost",
"index": [1, "demand1"],
"index_col": ["level", "node"]
}
}
In the example above, max_flow evaluates to 10 and cost evaluates to -100.
Tables¶
Each time an external data source is referenced using the "url"
keyword the data is reloaded from disk. If a dataset is going to be used multiple times in a model it can be defined in the "tables"
section of the JSON document. In this way the data will only be loaded once. Parameters can then reference the data using the "table"
keyword instead of the "url"
keyword. Although the index column applied to the data must be defined in the "tables"
section, the index used for each lookup can be different.
An example is given below using the demands.csv dataset shown previously. Two constant parameters are defined referencing data in the table.
"parameters": {
"oxford_demand": {
"type": "constant",
"table": "simple_data",
"column": "Demand",
"index": "Oxford"
},
"cambridge_demand": {
"type": "constant",
"table": "simple_data",
"column": "Demand",
"index": "Cambridge"
}
},
"tables": {
"simple_data": {
"url": "demands.csv",
"index_col": "City"
}
}