-
Notifications
You must be signed in to change notification settings - Fork 110
catalog.data.gov
a.k.a Catalog is a CKAN app containing an index (or catalog) for many federal, state, and municipal datasets. This is the main app of Data.gov and is generally what folks are thinking about when they refer to Data.gov.
Instance | Url |
---|---|
Production | catalog.data.gov |
Staging | catalog-datagov.dev-ocsit.bsp.gsa.gov |
ci | catalog.sandbox.datagov.us |
Sub-components:
- ckan
Services:
- apache2
- rds
- redis
- solr
Use dsh
to view the logs.
Web instances:
- /var/log/ckan/ckan.access.log
- /var/log/ckan/ckan.error.log
Worker (harvest) instances:
- /var/log/fetch-consumer.log
- /var/log/gather-consumer.log
- /var/log/harvester_run.log
An incomplete list of jobs (worker tasks) that catalog.data.gov performs.
- ckan-clean-deleted
- ckan-combine-feeds
- ckan-db-solr-sync
- ckan-export-csv
- ckan-harvest-job-cleanup
- ckan-jsonl-export
- ckan-metrics-csv
- ckan-report
- ckan-sitemap
- ckan-tracking-update
- qa-update-sel
Generates a CSV containing all datasets and how many page views they've had. https://filestore.data.gov/gsa/catalog/metrics/metrics-2019-04-30.csv
There is a "master" file on catalog.data.gov. The master file has to be appended each month.
ckan-php-manager performs several tasks, including generating a report on harvest sources. See README for full instructions.
$ php cli/harvest_stats_csv.php
Columns include:
- title
- name
- url
- created
- source_type
- org title
- org name
- last_job_started
- last_job_finished
- total_datasets
ckan-php-manager's tagging cli takes input from a csv file with headers dataset
, group
, categories
, and assign groups and category tags to datasets, or remove them from datasets.
$ php cli/tagging/assign_groups_and_tags.php
$ php cli/tagging/remove_groups_and_tags.php
All harvester commands should be run from one of the harvesters, usually catalog-harvester1p.
The harvest run command runs every few minutes to manage pending and in-progress harvest jobs. It will (not necessarily in this order):
- Queue jobs that have been scheduled
- Starts jobs that have been queued
- Clean up jobs that have completed or errored
- Email job results to points of contact
Run the job through supervisor.
$ sudo supervisorctl start harvest-run
The job is logged to /var/log/ckan/harvest-run.log
.
Common alerts we see for catalog.data.gov.
Usually manifesting as a New Relic Host Unavailable alarm, the apache2 services (CKAN) consume more and more memory in a short amount of time until they eventually lock up and become unresponsive. This condition seems to affect multiple hosts at the same time.
-
From the jumpbox, reload apache2 using Ansible across the web hosts
$ ansible -m service -a 'name=apache2 state=reloaded' -f 1 catalog-web-v1
-
For any individual failed hosts, use
retry_ssh.sh
(source) to repeatedly retry the apache2 restart on the host. Run this in a tmux session to prevent disconnects.$ ./retry_ssh.sh $host sudo service apache2 restart
-
Because the OOM killer might have killed some services in order to recover, reboot hosts as necessary.
$ ansible-playbook actions/reboot.yml --limit catalog-web-v1 -e '{"force_reboot": true}'