-
Notifications
You must be signed in to change notification settings - Fork 110
catalog.data.gov
a.k.a Catalog is a CKAN app containing an index (or catalog) for many federal, state, and municipal datasets. This is the main app of Data.gov and is generally what folks are thinking about when they refer to Data.gov.
Cloud.gov
Instance | Url |
---|---|
Production | catalog-prod-datagov.app.cloud.gov |
Staging | catalog-stage-datagov.app.cloud.gov |
development | catalog-dev-datagov.app.cloud.gov |
FCS (to be deprecated)
Instance | Url |
---|---|
Production | catalog.data.gov |
Staging | catalog-datagov.dev-ocsit.bsp.gsa.gov |
sandbox | catalog.sandbox.datagov.us |
Sub-components:
- ckan
Services:
- nginx
- PostgreSQL
- redis
- SOLR
See New Relic and/or [cloud.gov] for how to review logs
All available jobs are listed on the CKAN commands page
ckan-php-manager performs several tasks, including generating a report on harvest sources. See README for full instructions.
$ php cli/harvest_stats_csv.php
Columns include:
- title
- name
- url
- created
- source_type
- org title
- org name
- last_job_started
- last_job_finished
- total_datasets
A similar tool is located at https://github.com/GSA/catalog.data.gov/tree/main/tools/harvest_source_import#create-a-report-on-harvest-sources
ckan-php-manager's tagging cli takes input from a csv file with headers dataset
, group
, categories
, and assign groups and category tags to datasets, or remove them from datasets.
$ php cli/tagging/assign_groups_and_tags.php
$ php cli/tagging/remove_groups_and_tags.php
All harvester commands should be run from either the catalog-gather or catalog-fetch "apps". These cloud.gov apps run a single job; but you can run extra jobs using the cf-task
functionality on these apps.
This app or command is running at all times (other than restarts). This process is fault tolerant, and running multiple of these and the job stopping in the middle of processing will result in a lost state, and sometimes harvest jobs to be cancelled.
This app or command is running at all times (even during restarts). This job is fault tolerant for the most part, and will pick up where it left off in the queue. However, if it was in the middle of processing a dataset that item may not be handled properly (may be left in an undone state). There are also known issues with harvests creating duplicate datasets.
The harvest run command runs every 15 minutes to manage pending and in-progress harvest jobs. It will (not necessarily in this order):
- Queue jobs that have been scheduled
- Starts jobs that have been queued
- Clean up jobs that have completed or errored
- Email job results to points of contact
Run the job through github actions. See code here and go here to kick off the job manually.
Data.json harvest source has a known issue that creates duplicates datasets, the first harvested one of the duplicate has no harvest_object link on the UI and in DB, and it needs to be removed upon user's request. [UPDATE: for non-data.json harvest source, refer to this ticket].
This sql query can be used to get an overall picture of how many duplicate datasets in each organization.
SELECT g.name, COUNT(*) FROM package p
LEFT JOIN harvest_object h
ON p.id = h.package_id
JOIN "group" g
ON p.owner_org = g.id
WHERE p.state = 'active' and p.type = 'dataset'
AND h.package_id IS NULL
GROUP BY g.name
ORDER BY 2 DESC;
Use datagov-dedupe to remove the duplicates. Install it on you local then run this command to remove the oldest dataset and keep newest one:
Do a dry run:
pipenv run python duplicates-identifier-api.py {ORGANIZATION-NAME} --api-key {YOUR-API-KEY} --newest
After the dry run, compare the count from the generated *.csv file and the SQL query result, they should match for the organization.
Make real change:
pipenv run python duplicates-identifier-api.py {ORGANIZATION-NAME} --api-key {YOUR-API-KEY} --commit --newest -v
After done, upload the generated *.csv file to Google drive dedupe folder for record keeping.
On the current production catalog server, it takes about 3 seconds to process 1 record. So for any task with large amount of records to process, it is recommended to run with some auto retry logic such as
while ! command-copied-from-above; do sleep 60; done
Common alerts we see for catalog.data.gov.
No longer occurring, FCS related
Usually manifesting as a New Relic Host Unavailable alarm, the apache2 services (CKAN) consume more and more memory in a short amount of time until they eventually lock up and become unresponsive. This condition seems to affect multiple hosts at the same time.
-
From the jumpbox, reload apache2 using Ansible across the web hosts
$ ansible -m service -a 'name=apache2 state=reloaded' -f 1 catalog-web-v1
-
For any individual failed hosts, use
retry_ssh.sh
(source) to repeatedly retry the apache2 restart on the host. Run this in a tmux session to prevent disconnects.$ ./retry_ssh.sh $host sudo service apache2 restart
-
Because the OOM killer might have killed some services in order to recover, reboot hosts as necessary.
$ ansible-playbook actions/reboot.yml --limit catalog-web-v1 -e '{"force_reboot": true}'
No longer occurring, FCS related
The netscaler configuration verifies that sites are working and directs traffic only to working machines. To check that the server is responding appropriately, Netscaler checks with request HEAD https://{host_name}/api/action/package_search?rows=1
endpoint, expecting a 200
response code, to verify it is working as expected. Latest health check configuration.
A cache is created via CloudFront for the entire website to ensure traffic is served quickly and efficiently. Requests to /api/*
is not cached.
This has been handled by a number of automation steps, notably here
We have a number of initializations that are run manually to setup a database. In most cases our system reflects CKAN docs best practices.
You can find the credentials in the data.gov repo under the appropriate environment. The user used for CKAN may need to be created, instead of the master user being used.
No longer relevant, PYCSW deprecated
You'll need to create the necessary DB's and users for pycsw in the catalog postgres DB. These should be defined in the pycsw-all.cfg
.
Then, you'll need to initialize the DB by initializing the venv and then running the setup command from the current pycsw directory:
$ . .venv/bin/activate
$ ./bin/pycsw-ckan.py -c setup_db -f /etc/pycsw/pycsw-collection.cfg
These manual steps should be replaced by script proposed in ticket #4138.
Catalog Solr instances on ECS get restarted for various reasons on regular basis, such as memory exhaustion. There is a slight chance that the Solr core gets locked after restart. Here is the manual steps to restart it again in order to recover it.
- Visit Solr URLs to confirm the core is locked for one of the Solr instances. You can get the URLs and credentials with the follower commands.
cf t -s prod
cf env catalog-web | grep solr -C 2 | grep "uri\|solr_follower_individual_urls\|password\|username"
URL with port 9000
means follower 0, 9001
means follower 1, 9002
means follow 2. You need to know which follower that has a core lock for step 6.
-
Log into AWS SSBDev account and assume
ssb-production
role. -
Find the
Elastic Container Service
and get into ECS console. -
Find the cluster with name matching the
solr-########
found in step 1. -
You are under the
Services
tab. Now go toTasks
tab. -
Find the entry with Task Definition showing the locked follower number. Select it by clicking the checkbox.
-
Click the Stop button and click Stop again in the confirmation window.
-
The instance is on its way to recover. It usually takes 5-10 minutes for the ECS to restart it multiple times and finally it will get back to normal. Verify it by visiting the Solr URL found in step 1.