Skip to content

catalog.data.gov

Fuhu Xia edited this page Aug 20, 2021 · 28 revisions

a.k.a Catalog is a CKAN app containing an index (or catalog) for many federal, state, and municipal datasets. This is the main app of Data.gov and is generally what folks are thinking about when they refer to Data.gov.

Environments

Instance Url
Production catalog.data.gov
Staging catalog-datagov.dev-ocsit.bsp.gsa.gov
sandbox catalog.sandbox.datagov.us

Dependencies

Sub-components:

  • ckan

Services:

  • apache2
  • rds
  • redis
  • solr

Logs

Use dsh to view the logs.

Web instances:

  • /var/log/ckan/ckan.access.log
  • /var/log/ckan/ckan.error.log

Worker (harvest) instances:

  • /var/log/fetch-consumer.log
  • /var/log/gather-consumer.log
  • /var/log/harvester_run.log

Jobs

All available jobs are listed on the CKAN commands page

Common tasks

Harvest source stats

ckan-php-manager performs several tasks, including generating a report on harvest sources. See README for full instructions.

$ php cli/harvest_stats_csv.php

Columns include:

  • title
  • name
  • url
  • created
  • source_type
  • org title
  • org name
  • last_job_started
  • last_job_finished
  • total_datasets

Update datasets groups and category tags

ckan-php-manager's tagging cli takes input from a csv file with headers dataset, group, categories, and assign groups and category tags to datasets, or remove them from datasets.

$ php cli/tagging/assign_groups_and_tags.php
$ php cli/tagging/remove_groups_and_tags.php

Harvester commands

All harvester commands should be run from one of the harvesters, usually catalog-harvester1p.

Harvest run

The harvest run command runs every few minutes to manage pending and in-progress harvest jobs. It will (not necessarily in this order):

  • Queue jobs that have been scheduled
  • Starts jobs that have been queued
  • Clean up jobs that have completed or errored
  • Email job results to points of contact

Run the job through supervisor.

$ sudo supervisorctl start harvest-run

The job is logged to /var/log/ckan/harvest-run.log.

Remove duplicate datasets

Data.json harvest source has a known issue that creates duplicates datasets, the first harvested one of the duplicate has no harvest_object link on the UI and in DB, and it needs to be removed upon user's request.

This sql query can be used to get an overall picture of how many duplicate datasets in each organization.

SELECT g.name, COUNT(*) FROM package p
LEFT JOIN harvest_object h
ON p.id = h.package_id
JOIN "group" g 
ON p.owner_org = g.id
WHERE p.state = 'active' and p.type = 'dataset'
AND h.package_id IS NULL
GROUP BY g.name
ORDER BY 2 DESC;

Use datagov-dedupe to remove the duplicates. Install it on you local then run this command to remove the oldest dataset and keep newest one:

Do a dry run:

pipenv run python duplicates-identifier-api.py {ORGANIZATION-NAME} --api-key {YOUR-API-KEY} --newest

After the dry run, compare the count from the generated *.csv file and the SQL query result, they should match for the organization.

Make real change:

pipenv run python duplicates-identifier-api.py {ORGANIZATION-NAME} --api-key {YOUR-API-KEY} --commit --newest -v

After done, upload the generated *.csv file to Google drive dedupe folder for record keeping.

The script runs really slow. When processing records on the current production catalog server, it takes 50 seconds to process 1 record. So for any task with more than a few dozen of records to process, it is recommended to run with some auto retry logic such as

while ! command-copied-from-above; do sleep 60; done

Alert conditions

Common alerts we see for catalog.data.gov.

Rapid consumption of memory

Usually manifesting as a New Relic Host Unavailable alarm, the apache2 services (CKAN) consume more and more memory in a short amount of time until they eventually lock up and become unresponsive. This condition seems to affect multiple hosts at the same time.

Resolution

  • From the jumpbox, reload apache2 using Ansible across the web hosts

    $ ansible -m service -a 'name=apache2 state=reloaded' -f 1 catalog-web-v1
    
  • For any individual failed hosts, use retry_ssh.sh (source) to repeatedly retry the apache2 restart on the host. Run this in a tmux session to prevent disconnects.

    $ ./retry_ssh.sh $host sudo service apache2 restart
    
  • Because the OOM killer might have killed some services in order to recover, reboot hosts as necessary.

    $ ansible-playbook actions/reboot.yml --limit catalog-web-v1 -e '{"force_reboot": true}'
    

CDN and Health Checks

The netscaler configuration verifies that sites are working and directs traffic only to working machines. To check that the server is responding appropriately, Netscaler checks with request HEAD https://{host_name}/api/action/package_search?rows=1 endpoint, expecting a 200 response code, to verify it is working as expected. Latest health check configuration.

A cache is created via CloudFront for the entire website to ensure traffic is served quickly and efficiently. Requests to /api/* is not cached.

DB initialization

We have a number of initializations that are run manually to setup a database. In most cases our system reflects CKAN docs best practices.

You can find the credentials in the datagov-deploy repo under the appropriate environment. The user used for CKAN may need to be created, instead of the master user being used.

Pycsw

You'll need to create the necessary DB's and users for pycsw in the catalog postgres DB. These should be defined in the pycsw-all.cfg. Then, you'll need to initialize the DB by initializing the venv and then running the setup command from the current pycsw directory:

$ . .venv/bin/activate
$ ./bin/pycsw-ckan.py -c setup_db -f /etc/pycsw/pycsw-collection.cfg

Sandbox

In order to simplify recovery for sandbox destruction/recreation (as this happens much more often), the database, username, and password are used that are created from terraform. You'll need to review datagov-infrastructure-live for the necessary credentials.

These should be added/verified in the sandbox datagov-deploy secrets, as well as the new DB endpoint. This branch or commit should be deployed.

Then, you should manually run the steps for initializing the database; the steps laid out here (the Run migrations section).

The server should restart automatically, and the site should come up.

Clone this wiki locally