A primary facet to continuing my education through graduate school is to hone my ability to effectively communicate with people of other disciplines. Iowa city has done a good job facilitating this. It is here that people seem to be more receptive to general discourse. Being an instructor at the university has also exposed me to interactions with people of different majors and has given insight to how one of a particular study may tend to think in terms of solving a problem.

Ultimately, a computer scientist is tasked with building a logical model to solve some real world problem. The study of the algorithm is a means to generalize problem solving; it is a means to be exclusive from some domain. Having a firm grasp on this should allow one to step into any domain and solve the problems specific to that domain. They may not be as up to speed as one who has been immersed, but it does not take long for a good computer scientist to get there.

When I first started my studies, one of the classes during the first semester at community college was a database design course. This was necessary to understand how a website may store information. I learned a structured query language before any markup language and before any scripting language.

I’ve learned a lot since then. During the fall semester of 2022 I decided to circle back to the study of databases. This time in a new domain - Geography. I took the upper-level Introduction to Geographic Databases course to get a good feel of the functionality of a database management system such as PostGIS.

The following is a report pertaining to a project started within this course. Work is still being done here; it is incomplete. I believe these efforts are worthwhile beyond the coursework. The section describing the motivation of the project should make this apparent. It is hopeful that the general idea can be integrated into any applicable service, if not implemented as a service others may use.

Database Extension Gauging Cycling Safety


Bicycling safety is a question often avoided. This is in part due to the wider culture being subservient to the automobile. The cyclist is an annoyance to the car and the car is an annoyance to the cyclist. As a result, firm opinions are rooted in terms of who is to blame for these inconveniences. To provide a safe environment for one to cycle is an inconvenience to anyone driving a car, (and the taxpayer who doesn't cycle), thus change likely doesn't occur. The best option a cyclist often has is to try to avoid traffic.

There exists a set of resources a cyclist can take advantage of to gauge safety. Google maps may highlight trails that are commutable which are exclusive from automobile access. These maps may not be complete. is a resource for cyclists to communicate frequently traveled cycling routes in tandem with providing map pins to communicate cycling incidents. These incidents include bicycle collisions, near misses, hazards, and thefts. Unfortunately, the data set for these cycling incidents and routes are solely dependent on explicit contribution, and thus don't give a complete gauge of safety. Iowa, for example, only has two incidents noted and pinned. Five more can be added from personal experience.

This motivates the question of what other, (more complete), data sources can be used to help gauge overall cycle safety. The department of transportation provides two interesting sets that may help an individual gauge this:

  • Authoritative Annual Average Daily Traffic (AADT) volume is a spatial set of data in which traffic volume is measured for a wide range of roadways. Here, each roadway has this measure associated with its geographic description. This will give a cyclist a relative idea whether a road is busy and thus determine whether it should be avoided or if some other route or path that crosses the road should be avoided. It can also give a good idea if their cycling routes are in a worrisome proximity to these roadways.
  • Iowa Crash Vehicle Data (SOR) is a spatial set of data in which traffic crashes and their geographic coordinates are reported. These coordinates can be used to further determine overall safety - an area of a road with more crashes reported is less safe than those with lesser.

Adding these two sources of information to a platform, (such as, can allow a user to determine which of the cycling routes are more safe or less safe. It seems that does not allow a user to take advantage of viewing their individual Strava cycling activities in this context. One cannot sign in and view their Strava activity routes and query where these routes meet proximity or cross potentially dangerous roadways.

Thus, an infrastructure needs to be added to allow a user to sign on, authenticate, and associate their activities from Strava to the routes displayed in such a system. This motivates the creation of a database for this project - to house user information that can be associated with activity information which is geometrically associated with traffic volume (AADT) and crash vehicle data (SOR).

The following is a UML diagram describing such a database:

A UML diagram describing the database discussed in this report.
Preliminary UML diagram


Mentioned prior is a set of data sources: AADT, SOR, and Strava activities. It is indeed possible to gather up these collections of data where the geographical scope involved is the entirety of the United States. This is not feasible for this proof-of-concept project. Instead, the scope will be restricted to the state of Iowa. Thus, the data sources used are:

  • Iowa Authoritative Annual Daily Traffic (AADT) volume
    • Data Source: DOT Traffic Information
    • Data attributes and types are given in the weblink above. The important/relevant attributes to be maintained are:
      • AADT: A quantity that represents the annual average daily traffic count.
      • ROUTE_ID: an identifier for the geometry associated with the line.
  • Iowa Crash Vehicle Data (SOR)

It is likely that is using Strava metro to gather anonymized cycling routes for their activity reporting. Such information would be good in terms of generating estimated volume for a given cycling segment. Using this data for this project has two obstacles.

  • Using the Strava Metro API requires authorization which involves a timely application process.
  • The data is anonymized. This poses a problem for a user who is interested in taking a closer look at their own Strava activities.

For the scope of this project, the activity data is gathered from a personal Strava set of activities via the public developer API.

Data Prep

Using the Strava API

Strava allows API access for a developer to access the activities of a user which grants the asking application permission. The process entails registering an application within Strava's system. Each application has a unique client ID. Said application should direct a user to the following URL:<CLIENT_ID>response_type=code&redirect_uri=<APPLICATION_LOCATION>/exchange_token&approval_prompt=force&scope=activity:<SCOPE>

The bracketed attributes can be described as:

  • <CLIENT_ID>: The unique client ID previously described.
  • <APPLICATION_LOCATION>: The universal resource indicator to redirect once permission is granted or denied. For development, http://localhost can be used.
  • <SCOPE>: The scope of the read; the tier of information to be granted access. For this project, activity:read_all was used. The read_all attribute allows private activities to be scraped.

Once the redirect occurs, an authorization token is given via get parameterization. This can then be used to communicate to Strava that authorization has been given. A post request to the API with the client id, client secret token (given upon application creation), and authorization code generates a response with an access code to use to query the API. I.e.,

curl -X POST \
-F client_id=<CLIENT_ID> \
-F client_secret=<CLIENT_SECRET> \
-F grant_type=authorization_code

Responds with a user access token to be used via:

http GET "" "Authorization: Bearer <access token>"

Which can be piped into a json file.

The above http GET query returns, by default, 100 activities. Let's say this information is piped into a json file called morerides.json. A look through the file will expose a lot more attributes and their values than what's required of this database. The attributes are listed here.

This data needs to be sanitized. A script was developed to gather this information and generate a csv file which can be imported using pgadmin.

The python script developed for this project is as follows:

with open('morerides.json','r') as rides_file:
    rides_data = json.load(rides_file)

csv_string = "activity_id,athlete_id,route_id,route,type,location_country"
csv_string = csv_string + ",start_date,start_date_local,timezone\n"

for ride in rides_data:
    if ride['map']['summary_polyline'] and ride['map']['summary_polyline'] != '':
        #gather values:
        activityId = ride['id']
        athleteId = ride['athlete']['id']
        routeId = ride['map']['id']
        route = ride['map']['summary_polyline']
        type = ride['type']
        locationCountry = ride['location_country']
        startDate = ride['start_date']
        startDateLocal = ride['start_date_local']
        timezone = ride['timezone']

        #create row string
        line = str(activityId) + "," + str(athleteId) + "," + str(routeId) + ","
        line = line + str(route) + "," + str(type) + "," + str(locationCountry)
        line = line + "," + str(startDate) + "," + str(startDateLocal)
        line = line + "," + str(timezone) + "\n"

        #concatenate the row string to the csv_string
        csv_string = csv_string + line

with open('rides.csv','w') as rides_file:

This information can now be imported to a staging table. With a schema called "cycling" created, the table definition can be described as:

CREATE TABLE cycling.activities_staging
    activity_id BIGINT,
    athlete_id BIGINT,
    route_id varchar,
    encoded_route text COLLATE pg_catalog."default",
    activity_type char(8),
    location_country varchar,
    start_date timestamp,
    start_date_local timestamp,
    timezone varchar,
    PRIMARY KEY(activity_id)

Highlighted is a column called "encoded_route". Strava stores its polyline information as an encoded polyline string. This is a syntax used by services such as open street map. Indeed, Strava does not inform developers which geographic projection system is used. This was surmised by the fact open street map uses WGS84. The fact that this data exists as an encoded datatype motivates the reason why a staging table exists in the first place. This is to allow casting as a multiline datatype when populating the activities table occurs. (Essentially, a transactional failsafe.)

The activities table is created via:

CREATE TABLE cycling.activities
    activity_id BIGINT,
    athlete_id BIGINT,
    route_id varchar,
    encoded_route text COLLATE pg_catalog."default",
    activity_type char(8),
    location_country varchar,
    start_date timestamp,
    start_date_local timestamp,
    timezone varchar,
    user_id BIGINT,
    route geometry(linestring,4326),
    PRIMARY KEY(activity_id),
    CONSTRAINT fk_athlete
    FOREIGN KEY(user_id)
    REFERENCES cycling.athletes(user_id)

And the data insertion query is as follows:

INSERT INTO cycling.activities(
SELECT staging.activity_id,
        (SELECT ST_AsEWKT(ST_LineFromEncodedPolyline(staging.encoded_route)))
            FROM cycling.activities_staging as staging
            JOIN cycling.athletes as u ON strava_athlete_id = athlete_id

What has not been shared is the creation of an athletes table. This table creation is a trivial matter. What's important is the join statement of the insertion query which adheres to the foreign key constraint of the tables involved.

Using data of the Department of Transportation

The transition from shape file to staging table was straight forward in terms of the data associated with the DOT sources. Staging tables are indeed required, though, as these data sets contain redundant information, sparse data on account of the 40+ attributes for each source, and the storage required to house all this. Trimming this fat will hope the project maintain scope and performance.

A screenshot showing the significant differences in sizes between data tables.
Note the contrast of storage between the production tables and their staging counterparts.

The traffic volume table can be described as:

CREATE TABLE cycling.traffic_vol
    gid INT,
    route_id varchar,
    aadt BIGINT,
    aadt_year SMALLINT,
    effective_start_date date,
    effective_end_date date,
    geom geometry(multilinestring,4326),
    PRIMARY KEY(gid)

This contains the information that's required, and no more. There are two obstacles in terms of migrating the relevant data from its staging table and the production table:

  • The Spatial reference system differs with this datum. The SRID is 3857, contrary to 4326. Conversion here is initially trivial, until it's realized that the measurements contain z-values. Thus, the data needs to be flattened to a two-dimensional plane.
  • The data gathered from AADT contains reports from a wide range of years. It could be the case that a certain road has information associated with the years 2015, 2016, and 2020. Whereas another road may only have information associated with the year 2021. The most recent year needs to be considered.

The following insertion query addresses these two obstacles:

INSERT INTO cycling.traffic_vol (gid,route_id, aadt, aadt_year, effective_start_date,
    SELECT distinct gid,route_id, aadt, aadt_year, effective_, effectiv_1,
    ST_Force2D(ST_Transform(geom,4326)) FROM cycling.traffic_vol_staging WHERE effective_ is
    not null ORDER BY effective_ DESC;

(Note that the shapefiles have poorly named attributes; "effective_" is the column name for the attribute describing the effective start date.)

Migration from the SOR staging table to populate vehicle crashes within the crashes table was much simpler. An application of the ST_Force2D function, along with the ST_Transform, were leveraged and the range of years was restricted to any crash after 2017. (This restriction can easily be levied for production deployment).

Visualizing the Data

The data can now start being used. The following figure shows the grouping of activities gathered via Strava being displayed:

A map showing a set of lines that chart cycling routes taken from Strava.
Mapping of Strava activities

The following shows these activities overlaying the traffic network. Keep in mind that each road segment has an associated AADT value:

A map showing the previous strava mappings that are overlaying a set of lines describing public roads.
Strava activities overlaying AADT mapping

And then traffic crash reporting can be overlayed:

A map showing the previous strava and public road mappings that are overlayed by a set of points describing locations of motorized crashes.
Strava activity and AADT mappings in overlayed by SOR crash plottings.

Queries gauging proximity safety can start being employed. I.e.,

SELECT crash.geom FROM cycling.crashes as crash, cycling.activities as activity
WHERE ST_DWithin(crash.geom,activity.route,0.01);


One facet of the database has yet to be discussed. The entity that is a Cycling Segment. The intention for this table is to provide a space to house individual segments that constitute a multiline. Each segment, (a single/mono-line as far as the database is concerned; not to be confused with a Strava segment), is to have some quantifier which tracks the amount of activities that share it. This will allow the viewing of some metric which can help determine the amount of cyclists that commute the segment which can help hone safety measures. (The more cyclist that travel on a segment may be a positive indicator for safety). It may also be easier to isolate which Strava activities cross a motorized roadway and help measure the safety of a particular route.

First step of this process is to find the individual points that constitute a multiline. ST_DumpPoints is a good start in this effort. It produces an index for each point being produced along with the geometry. Consider the following query:

SELECT route_id, (dp).path[1] As path_index, ST_AsTEXT((dp).geom) AS node
FROM (SELECT route_id, ST_DumpPoints(route) AS dp FROM cycling.activities) as segments;

This query produces a set of results in which the route_id is associated with a set of points which are also given an index. Let's assign this to a view called list_points. The task is now to wrap these points up into individual lines. This can be accomplished with the following query:

SELECT lp1.route_id, st_makeline(lp1.node, lp2.node) as line FROM list_points as lp1,
list_points as lp2 WHERE lp2.path_index - lp1.path_index = 1 AND lp1.route_id = lp2.route_id;

What makes the above query work is the logic in the where clause. The statement "lp2.path_index - lp1.path_index = 1" ensures that points are contiguous form a line. Performing basic algebraic manipulation to the statement can help highlight this. The statement “lp1.route_id = lp2.route_id” ensures that the points selected for the st_makeline function are only from the same activity.

The latest selection query segments each activity into a set of mono-lines. There is one remaining problem though - overlap is not factored! If these lines are to be stored in the segments table, there would be a lot of redundant information stored. Consider the following figures:

A map showing the initial Strava activity mappings where the lines are now represented by their vertices.
Individual plotting of vertices for each activity polyline. Scale of 1:20929. Query used was SELECT * FROM list_points

A map showing the vertices of the initial Strava activity mappings with a more precise scale.
Same plotting of vertices for activity polylines. Scale is now 1:5232; Same query as previous figure.

A zoomed in scale of the initial Strava activities exposing many redundant lines.
Conversion of polyline vertex plotting to contiguous monolines. Scale is now 1:654.

This exposes a problem in terms of storage. It should not be necessary to store all these lines. The question is what geometric threshold should be used to consider one line segment as equivalent as another. Intuitively, this threshold should either be geometric scale or distance.

Current status and the future of this project

The problem exposed in the previous section is a problem yet to be solved. It is uncertain if there exists a geometric function, (or set of functions), within postgres to quickly help solve this issue. Alternatively, a threshold of proximity needs to be discussed with someone who is close to the field of Geography.

Once this problem is solved, queries can be employed to show which activities include a segment that either cross a motorized roadway and/or run within a certain distance of. The number of intersections and distance running in proximity can be used in tandem with traffic volume measurement and crash counts to present some safety measure for a region within the activity. This data can also be used in a regional scope to give an idea of which regions/territories are safer for cycling.

The scope of this project has been from the perspective as a computer scientist instead of a geographer. The primary motivation is to help a cyclist make informed decisions. Some often take the term "help" to be a means to dictate through result of research and analysis. My goal has been to develop a tool others may use. This goal is close to being accomplished in terms of providing a solid foundation. Once achieved, the foundation may be built upon as a web application or be proposed to a group such as those who maintain and develop