Create association between GA and CKAN to group page views by agency
User Story
In order to improve transparency to agency partners and the public, Data.gov needs to integrate unique tracking groups of agency partners and their datasets. This will allow Data.gov to publish statistics on the agency partners dataset viewership.
Acceptance Criteria
- [ ] GIVEN current CKAN version and Google Analytics status for Data.gov
THEN Data.gov integrate Google Analytics 4 and CKAN to group datasets into separate analytic data groups to determine overall agency viewership
Background
With the completed migration to GA4 and the integration into CKAN, Data.gov can now set up unique tracking groups for each agency.
Sketch
- [ ] Determine how to associate a data set with a tracking group in CKAN
- [ ] Create an agency tag in GA4
- [ ] Validate the agency data is being populated with the correct tags
I believe we would implement this at the harvest source level so we can associate datasets with each unique ID.
@dlennox24 attempted to do this by adding the values to the datalayer. When testing the solution we found:
- The values were not populating by the time the GA code ran, which meant at pageload, we could not collect the values.
- Those values were possibly susceptible to being translated by browser translation capabilities, which would scatter the values in reporting
I ended up working with a colleague to identify a spot where publisher and organization appeared in the DOM, but would not be translated. We thought we found this in the breadcrumbs:
So we created CSS Selector variables to capture that URL, and then parse it to separate the query params for publisher and organization
Unfortunately, I then discovered that some pages do not have a publisher, and in that case, the entire li we were capturing does not appear.
So, I re-wrote a bunch of the variables to look the the last child li of the breadcrumb and created and IF/ELSE variable that would either get the query param for organization if it existed, or capture the last page path of the URL if not (which is the org when publisher is missing. The final setup in GTM does the following:
- Capture the URL found in the las-child li of the breadcrumbs
- Look to see if it has a query parameter (?)
- If it does, parses the URL to get the organization param
- If it doesn't, grab the last page path or the url, which is the org
- Parse the URL for publisher if it's there, otherwise return NO PUBLISHER
- Use Regex lookup to see if it's a dataset page by looking for /dataset/ to output both organization and publisher, and if not, returns NOT DATASET
I added DATAGOV_dataset_organization and DATAGOV_dataset_publisher as custom dimensions in GA and published the GTM container to prod. Testing looked good on this in debug mode, but tomorrow when I can check GA will be the real QA.
This is working, but not sufficiently. We are getting 60% of pageviews with org and publisher and 40% appearing as (not set).
Going to create a separate ticket to troubleshoot.
There's a draft PR attached to this ticket that should be addressed. @robert-bryson can you take a look at that and update the status? Thanks.
https://github.com/GSA/ckanext-datagovtheme/pull/193
moving to in review until the above gets resolved so we don't lose track of it.
As @tdlowden mentioned, the work was complete as designed. However, in practice the data flow was inconsistent. So https://github.com/GSA/data.gov/issues/4743 was created to troubleshoot. That draft PR is kind of an in-between mitigation step after this ticket was done, but before 4743 was created. It will be closed in favor of the work being done by Robby.