-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data Liberation] Topological sorter, entities remapping and add missing imports #2030
base: trunk
Are you sure you want to change the base?
Conversation
c51d9c4
to
7778714
Compare
packages/playground/data-liberation/src/import/WP_Stream_Importer.php
Outdated
Show resolved
Hide resolved
packages/playground/data-liberation/src/import/WP_Stream_Importer.php
Outdated
Show resolved
Hide resolved
packages/playground/data-liberation/src/import/WP_Topological_Sorter.php
Outdated
Show resolved
Hide resolved
85c850b
to
78f1cb3
Compare
78f1cb3
to
5af5722
Compare
* Quicksort performs badly on already sorted arrays, O(n^2) is the worst case. | ||
* Let's consider using a different sorting algorithm. | ||
*/ | ||
uksort( $elements, $sort_callback ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires fitting all $elements
into memory. This may be fine for v1, since every $element is relatively small, but we'll hit the limits of this approach sooner than later. Is it possible to perform topological sorting with at a reasonable speed without holding everything in memory? If not, how much RAM would this need to process one of these huge VIP 1TB exports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can save the data in a custom DB table (Jetpack have a custom wp_jetpack_sync_queue
table) instead of a simple array in memory. I didn't disturb the database in this first phase, but we can. Reproducing this in a table and using the custom sort here is straightforward. At the end of collecting the byte offsets, it's a matter of ALTER TABLE wxr_sorting ORDER BY custom_sorting_like_the_one_above
, and we are done, even with our new reentrancy model.
When it finds a row in this table, the streamer can jump to the "correct" position (the one it should load before the current one) and proceed to the next one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lovely! If it's straightforward, let's include it in v1 – it seems like we can ship something that's on the right track with relatively little effort. It would also enforce shaping the API to stream the sorted list instead of loading it into memory, which would have implications for the reentrancy cursor. It may seem excessive for v1, but thinking about it more it seems like something that could have a ripple through the entire system if we don't account for streaming early on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may seem excessive for v1, but thinking about it more it seems like something that could have a ripple through the entire system if we don't account for streaming early on.
Agree. It's reasonable, let's proceed this way. 👍
495be8b
to
e95618e
Compare
@@ -93,7 +93,7 @@ public function __construct( $options = array() ) { | |||
$this->mapping['term_id'] = array(); | |||
$this->requires_remapping = $empty_types; | |||
$this->exists = $empty_types; | |||
$this->logger = new Logger(); | |||
$this->logger = isset( $options['logger'] ) ? $options['logger'] : new WP_Logger(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not sure we need a logger beyond _doing_it_wrong()
. All the progress information are now exposed as numbers and the API consumer may implements its own logging using any technique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, thanks.
packages/playground/data-liberation/src/import/WP_Stream_Importer.php
Outdated
Show resolved
Hide resolved
if ( true === $this->topological_sort_next_entity( $count ) ) { | ||
return true; | ||
} | ||
|
||
// We indexed all the entities. Now sort them topologically. | ||
$this->topological_sorter->sort_topologically(); | ||
$this->topological_sorter = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It reads confusing to me:
- Sort every entity we encounter and short circuit while we have entities
- Once we run out of entities to sort, sort them
What would be another way to call these operations?
packages/playground/data-liberation/src/import/WP_Stream_Importer.php
Outdated
Show resolved
Hide resolved
packages/playground/data-liberation/src/import/WP_Topological_Sorter.php
Outdated
Show resolved
Hide resolved
/** | ||
* The base name of the table. | ||
*/ | ||
const TABLE_NAME = 'data_liberation_index'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called index, but it's not used in the index_next_entities()
stage, only in the topological sort stage. How about data_liberation_topological_index
or something to that effect?
* Run in the 'plugins_loaded' action. | ||
*/ | ||
public static function load() { | ||
if ( self::DB_VERSION !== (int) get_site_option( self::OPTION_NAME ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about the distinction between load()
and activate()
– could we either have one or document them a bit more to explain the usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The activation/deactivation is when a user clicks "activate" and "deactivate" link. The load action is usually used if you need to change schema:
The best practices are here:
- Installation, activation steps: https://developer.wordpress.org/plugins/plugin-basics/activation-deactivation-hooks/
- Custom tables: https://learn.wordpress.org/lesson/custom-database-tables/
- Schema change: https://developer.wordpress.org/plugins/creating-tables-with-plugins/
Names can be misleading. Let me add more docs.
$table_name = self::get_table_name(); | ||
|
||
// Create the table if it doesn't exist. | ||
// @TODO: remove this custom SQLite declaration after first phase of unit tests is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the CREATE TABLE statement doesn't work with the legacy SQLite integration we're using now? cc @JanJakes this could be useful for testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JanJakes it is, thanks! Removed the special SQL.
/** | ||
* Run by register_deactivation_hook. | ||
*/ | ||
public static function deactivate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be any advantage to running this when the plugin is uninstalled and not just deactivated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read this here https://learn.wordpress.org/lesson/custom-database-tables/:
If the users of your plugin do not need the data in this table if they deactivate the plugin, you could trigger this on the plugin deactivation hook.
register_deactivation_hook( __FILE__, 'wp_learn_delete_table' );
However, if the data in that table is important, and your users might want to keep it, even if the plugin is deactivated, you could delete the table using one of the two uninstall methods available to plugins.
Our data is not "important" in the strict sense of the term. It can be recreated again; it's just bulky. So, removing it when the user deactivates the plugin is okay. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good 👍
* @param int $byte_offset The byte offset of the category. | ||
* @param array $data The category data. | ||
*/ | ||
public function map_category( $byte_offset, $data ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map makes me think of map/reduce and a mapper callback. How about insert_category
or add_category
or index_category
? Ditto for the other map_ method
'element_id' => (string) $data['term_id'], | ||
'parent_id' => $category_parent, | ||
'byte_offset' => $byte_offset, | ||
// Items with a parent has at least a sort order of 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's document this more. What's the logic behind the sort_order of 1, 2 etc? Should it be more for deeply nested categories or not?
* | ||
* @var int | ||
*/ | ||
protected $orphan_post_counter = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to restore this (or any other) value after stopping and resuming the import?
* | ||
* @return int|bool The byte offset of the category, or false if the category is not found. | ||
*/ | ||
public function get_category_byte_offset( $session_id, $slug ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many methods can we make private
? Every public method makes a public API and gets covered by the WordPress indefinite BC policy upon merge to WordPress core.
|
||
return $wpdb->get_var( | ||
$wpdb->prepare( | ||
'SELECT byte_offset FROM %i WHERE element_id = %s AND element_type = %d AND session_id = %d LIMIT 1', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to fetch these in chunks of 100 or 1000 to avoid hitting the database every time? Also, noting the docstring says "and remove it from the list" while the method doesn't perform the removal.
|
||
// MySQL version - update sort_order using a subquery | ||
return $wpdb->query( | ||
$wpdb->prepare( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How well does this perform on large datasets? Running a single, table-wide UPDATE that would potentially affect a few hundred million rows is making me anxious, although that's just a gut feeling.
@adamziel I cherry-picked those parts not strictly related to the topo-sort in these two PRs:
I would like to see some fixes and missing parts merged now that our discussion is going on. 👍 |
@@ -108,6 +113,9 @@ public function __construct( $options = array() ) { | |||
'default_author' => null, | |||
) | |||
); | |||
|
|||
WP_Topological_Sorter::activate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implicitly runs a bunch of database operations in the object constructor. If we create 10 objects, it will run 10 times. If it fails, we don't have a good way of reporting that. What do you think about running it in a plugin-related hook instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has already run for the first time during the activations. I can remove it from here.
if ( array_key_exists( 'session_id', $options ) ) { | ||
$this->set_session( $options['session_id'] ); | ||
} else { | ||
$active_session = WP_Import_Session::get_active(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we're not using an WP_Import_Session
object for this import run? Note nothing in WP_Entity_Importer
depends on direct access to WP_Import_Session
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
@@ -108,6 +113,9 @@ public function __construct( $options = array() ) { | |||
'default_author' => null, | |||
) | |||
); | |||
|
|||
WP_Topological_Sorter::activate(); | |||
$this->topological_sorter = new WP_Topological_Sorter( $this->options ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this object only exist for the duration of the stages where it's used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used even later during the import stage.
@@ -257,6 +267,7 @@ public function import_user( $data ) { | |||
* @param array $userdata Raw data imported for the user. | |||
*/ | |||
do_action( 'wxr_importer_processed_user', $user_id, $userdata ); | |||
// $this->topological_sorter->map_entity( 'user', $userdata, $user_id ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be uncommented? Or is it a placeholder for later? Then let's document inline how will we know to uncomment it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is commented out because I'm not sure what we do with wp_insert_user
. If we will reassign, rewrite or possibly remap the user of the wxr with the preexisting one, like in other importers.
@@ -267,13 +278,13 @@ public function import_term( $data ) { | |||
* @param array $meta Meta data. | |||
*/ | |||
$data = apply_filters( 'wxr_importer_pre_process_term', $data ); | |||
$data = $this->topological_sorter->get_mapped_entity( 'term', $data ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read this line without any additional context, and I'm confused about what's happening here – we use a sorter
, but instead of sorting, we map.
Let's come up with a set of terms that meaningfully reflect that entire idea space. Ideally, this line would communicate something to a new person without requiring additional work, such as diving into the source code of that method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, could we only source specific values from the sorter instead of blanket-replacing the entire $data
array? This leads to a section in the code below where we just accept the parent
value from $data
instead of explicitly using the mapped parent value. It's an easy way for unprocessed user input to slip through if we miss even a single check in a single code path. We can reduce the likelihood of running into that entire class of errors by just saying "sorry, $data['parent']
won't come through, we use $mapped_parent
instead".
if ( isset( $this->mapping['term'][ $parent_id ] ) ) { | ||
// Map the parent term, or mark it as one we need to fix | ||
if ( $parent ) { | ||
// TODO: add parent mapping and remapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO still relevant? Or can this entire section be cleaned up and those commented code parts removed? Topological sorting should remove the need for remapping, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old comment can be cleaned up, yes.
@@ -314,9 +327,30 @@ public function import_term( $data ) { | |||
// Wipe the parent for now | |||
$data['parent'] = 0; | |||
}*/ | |||
$parent_term = term_exists( $parent, $data['taxonomy'] ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would running a SELECT before every insert be a show-stopper for importing, say, 100M records to an already large database? Or would that be manageable? cc @smithjw1 for VIP thoughts
$data['parent'] = $parent_term['term_id']; | ||
} else { | ||
// It can happens that the parent term is not imported yet in manually created WXR files. | ||
$parent_term = wp_insert_term( $parent, $data['taxonomy'] ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the parent's parent is also missing? In other words, imagine we have a hierarchy of 10 nested parents and we're to process a term at the very bottom of that hierarchy?
Here's a visual representation:
Root Parent
│
├── Parent Level 1
│ │
│ ├── Parent Level 2
│ │ │
│ │ ├── Parent Level 3
│ │ │ │
│ │ │ ├── Parent Level 4
│ │ │ │ │
│ │ │ │ ├── Parent Level 5
│ │ │ │ │ │
│ │ │ │ │ ├── Parent Level 6
│ │ │ │ │ │ │
│ │ │ │ │ │ ├── Parent Level 7
│ │ │ │ │ │ │ │
│ │ │ │ │ │ │ ├── Parent Level 8
│ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ ├── Parent Level 9
│ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ └── Our term
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why would a parent be missing if we've already sorted everything topologically?
/** | ||
* Map an entity to the index. If $id is provided, it will be used to map the entity. | ||
* | ||
* @param string $entity_type The type of the entity. | ||
* @param array $data The data to map. | ||
* @param int|null $id The ID of the entity. | ||
* @param int|null $additional_id The additional ID of the entity. | ||
*/ | ||
public function map_entity( $entity_type, $data, $id = null, $additional_id = null ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method name and docstring don't explain much about this method. Let's include a thorough description of what this method does. It might take quite a few paragraphs of text and even be longer than the method itself – it's fine, and it's well worth it in the long run.
The role model for documenting mission-critical and non-trivial code is the WP_HTML_Tag_Processor class. All the mental models are explained in depth and with examples. It discusses paths not taken, failure modes, and design decisions.
Here's just a few questions that came up for me as I read this: What's "the index"? Where does it live? What is its layout? What is it used for? How do we retrieve from it? Why an $id
is needed? What is the relevance of $additional_id
? What if one of them is missing? Can this method fail, and how do we know?
@zaerl can we move all the |
|
||
// Map the parent ID if the entity has one. | ||
switch ( $entity_type ) { | ||
// @TODO: missing comment parent ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would it take to support comment parent ID?
break; | ||
case 'post': | ||
if ( 'post' === $data['post_type'] || 'page' === $data['post_type'] ) { | ||
if ( array_key_exists( 'post_parent', $data ) && '0' !== $data['post_parent'] ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're dealing with raw user input here so we may also need to check if it's an int etc
} | ||
|
||
// The entity has been imported, so we can use the ID. | ||
if ( $id ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale for calling a topological sorter that is meant to sort the entity stream before we start importing, after an entity is imported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is it doubling as an ID mapper? While reusing the database makes sense, I'm not sure if topological sorter class is the best place for the ID mapping logic. It might take us some discussion to find the right place. Let's remove anything related to mapping IDs for now and focus just on pre-import sorting.
I've finished the review and I'm very confused. It seems like it's an ID mapper that doesn't actually sort the entity stream topologically. Am I missing something? |
Already done: |
The initial memory-array sorting and CTE were shelved because they had problems with massive data sets. I've mentioned that in the discussion of design choice 8. The sort order is not used, only the remapping of the IDs and the other changes that have been moved to the other two PRs. I will try something else now that I've read your ideas on the discussion and tell you what I will do before implementing the new code. We can take this on draft. Thanks. |
You can find a discussion here with a more detailed explanation: #2090
Motivation for the change, related issues
Topological Sorting of WXR entities before starting the import to ensure parent posts are imported before child posts. Processing WXR in a topological order may require an index with all offsets and lengths of items in the WXR.
Implementation details
Entities preloading/loading/mapping
The topological sort happens during a
STAGE_TOPOLOGICAL_SORT
phase runt before theSTAGE_FRONTLOAD_ASSETS
.This PR removes the
WP_Entity_Importer::mapping
andWP_Entity_Importer::exists
arrays and all memory preload. It is slow and memory-consuming, with thousands of entries and not support sessions. It adds a new table with import IDs and mapped IDs. During the first phase, it is prefilled. During theWP_Entity_Importer
import it maps the imported IDs by using thewxr_importer_*
filters and actions.New WP-CLI scriptThis PR also introduces the new CLI script and moves the logger there.moved to #2104New unit tests
Added all the 130 WordPress core unit tests. Updated the WXRs to latest version.Added aPlaygroundTestCase
base class that cleanup the WordPress database after a test, as_delete_all_data
doesMoved to [Data Liberation] Add support for terms meta and new unit tests #2105
Missing imports and hierarchies
Added support for:
Term metaNew PHPUnit filterThis PR adds aPHPUNIT_FILTER
constant topackages/playground/data-liberation/tests/import/blueprint-import.json.
If the value is not falsy it will be passed to PHPUnit when callingnpx nx run playground-data-liberation:test:wp-phpunit
. So, for example, you can set"PHPUNIT_FILTER": "WPRewriteUrlsTests".
It will be the same as runningphpunit --filter WPRewriteUrlsTests
.Moved to #2105
Testing Instructions (or ideally a Blueprint)
Unit tests
Many of the tests can be run only in a real WordPress environment, that's why
wp-phpunit
exists.Data test
.xml
you find inpackages/playground/data-liberation/tests/wxr/*
About to add and pass all the tests found here https://github.com/WordPress/wordpress-importer/tree/master/phpunit/tests.
New scriptor:It accepts a file, a URL, or a folder. Runmoved to #2104wp help data-liberation import
to see all options.