Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSM-POI: Include brand property [DRAFT] #69

Draft
wants to merge 43 commits into
base: master
Choose a base branch
from

Conversation

IritaSee
Copy link
Contributor

@IritaSee IritaSee commented Jan 3, 2022

In this PR, I wanted to solve issue #25 by creating a CSV file to list and also help the brand property matching process. This PR including:

  • Compile a list of all brand names and operator names from the name-suggestion-index as a CSV with the columns id, display_name, and wiki_data in tmp folder.
    Deadline: 05.01.2022

  • Create a PySpark UDF similar to the ones in the osm-poi/src/Processor.py that takes a string and returns the best match against the name-index-suggestions using fuzzy-wuzzy
    Deadline: 06.01.2022

  • Apply that function for the brand and operator columns which creates two new columns brand_matched and operator_matched.
    Deadline: 07.01.2022

@IritaSee IritaSee requested a review from mattigrthr January 3, 2022 09:11
@IritaSee IritaSee added enhancement New feature or request good first issue Good for newcomers pipeline/osm-poi Issues related to the osm-poi pipeline labels Jan 3, 2022
@IritaSee IritaSee linked an issue Jan 3, 2022 that may be closed by this pull request
Copy link
Contributor

@mattigrthr mattigrthr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result is looking good! 👍🏽 Just added a few naming and coding convention comments.

kuwala/pipelines/osm-poi/src/Downloader.py Outdated Show resolved Hide resolved
kuwala/pipelines/osm-poi/src/Downloader.py Show resolved Hide resolved
kuwala/pipelines/osm-poi/src/Downloader.py Outdated Show resolved Hide resolved
kuwala/pipelines/osm-poi/src/Downloader.py Outdated Show resolved Hide resolved
kuwala/pipelines/osm-poi/src/Downloader.py Outdated Show resolved Hide resolved
@mattigrthr
Copy link
Contributor

@IritaSee as discussed, here's some more background and assistance for task 3:

To add a new column to the PySpark data frame, use withColumn().
For example df_pois = df_pois.withColumn('brand_matched', match_brand_name(col('brand'))) would add a new column to df_pois called 'brand_matched' which would contain the value returned from the PySpark-UDF match_brand_name.

The UDF itself is applied for each row. So when you pass the col('brand') to the UDF it is working on all values individually (e.g., if there are 1,000 rows in that data frame, the UDF is executed 1,000 times). It effectively works on one string, e.g., 'McDonalds', and simply returns the best match from the brand name list.

Since the UDF is executed many times, we don't want to and can't read the CSV with the brand names a thousand times within the UDF. That's what 'broadcasts' in PySpark are for. They wrap a serializable value and make them available from inside the UDF. This is an example of how to use a broadcast from inside a UDF:

image

To make the main function more readable, you can create a static function that reads the CSV with the brand names, creates a broadcast from it, adds the two columns brand_matched and operator_matched by applying the UDF and returns the result.

The static function would be applied after the way, node, and relation data frames have been joined. It could look something like this: df_pois = Processor.match_brand_and_operator_names(df_pois)

Copy link
Contributor

@mattigrthr mattigrthr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From looking and the code it seems basically ready. What is still missing @IritaSee?

kuwala/pipelines/osm-poi/src/Processor.py Outdated Show resolved Hide resolved
kuwala/scripts/initialize_windows.sh Outdated Show resolved Hide resolved
@IritaSee
Copy link
Contributor Author

From looking and the code it seems basically ready. What is still missing @IritaSee?

I feel not ready yet with the PR because of testing issue... I tested them all but I feel the branch did not run smoothly... I think we need to discuss about this but not for so long :") @mattigrthr

@IritaSee IritaSee marked this pull request as ready for review January 27, 2022 14:27
@IritaSee IritaSee requested a review from mattigrthr January 27, 2022 14:27
Matti Grotheer added 2 commits February 2, 2022 17:34
# Conflicts:
#	kuwala/common/python_utils/src/FileSelector.py
#	kuwala/common/python_utils/src/spark_udfs.py
#	kuwala/core/database/importer/src/population_density_importer.py
#	kuwala/pipelines/osm-poi/src/Downloader.py
#	kuwala/pipelines/osm-poi/src/Processor.py
#	kuwala/pipelines/osm-poi/src/main.py
@mattigrthr
Copy link
Contributor

mattigrthr commented Feb 3, 2022

This PR is currently on hold. See: #25 (comment)

After doing some initial tests with the brand and operator name matching, it turns out that including the matching in the OSM-POI pipeline directly would increase the runtime significantly. Therefore, we have decided to store the consolidated list of brand and operator names in a separate table in Postgres, which can then be used later in transformation blocks (e.g., on a filtered set of POIs and thus drastically reduce the runtime).

Since the canvas development currently has a higher priority for the core team, this issue is up for grabs again.

@mattigrthr mattigrthr marked this pull request as draft February 3, 2022 13:47
@mattigrthr mattigrthr requested review from mattigrthr and removed request for mattigrthr February 3, 2022 13:48
@IritaSee IritaSee changed the title OSM-POI: including brand property OSM-POI: Include brand property [DRAFT] Mar 2, 2022
@arifluthfi16 arifluthfi16 deleted the feature/include-brand-property branch May 3, 2022 08:59
@arifluthfi16 arifluthfi16 restored the feature/include-brand-property branch May 3, 2022 11:32
@arifluthfi16 arifluthfi16 reopened this May 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers pipeline/osm-poi Issues related to the osm-poi pipeline
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OSM-POI: Include brand property
3 participants