Here are my latest discoveries about concurrent LPM :
We had at work a situation where we could not perform concurrent live partition migration between our two p795, however or whatever we tried (multiple HMC/SDMC, multiple MSPs, cross-site migrations…), and opened a PMR at IBM about that.
The IBM guy told me about the max_virtual_slots value (which is set in the partition profile), which could cause some problems if greater than 1000. A Fix is on the way, according to IBM :
Indeed, for various internal reasons (numbering adapter’s ID policy), we had set this value to 2048 or even 4096 on our VIO servers. Big mistake. No concurrent migration could be performed.
First of all we need to decrease the max_virtual_slots value to… let’s say 256 on all our VIOS’ profiles :
chsyscfg -r prof -m managed_system -i "name=VIO1,lpar_name=VIO1-prod,max_virtual_slots=256"
But we have to stop the partition in order to load the profile with the new max_virtual_slots value.
More important, we need to change all the adapter ID’s greater than 1000 (we numbered our FC adapter’s ID according to the client ID times 100 , which grows rapidly on a p795 – a client ID 44 could give a FC adapter ID of 4400… to allow that , you need to increase the max_virtual_slots value…) to smaller values, which is a pain in the ass, because you have to delete/recreate the virtual FC adapter on the VIO (system and profile), do a cfgmgr, remap the vfchost to the physical fcs, and modify the profile on the VIO AND the virtual server in order to get things proper. And pray that your multipathing is fully working. Good luck.
So we came with a different method, using fake virtual adapters and LPM, in order to get smaller device IDs for our Fibre Channel virtual adapters, and shutdown/restart the VIOS , one by one), without any downtime at all :
- To modify the max_virtual_slots, I have to change the id on the client side (e.g. 2101 and 2102 as server adapter ID), then on the VIOs side.
- But without stopping the partition and alter the profile, there is only one solution :
- I created “fake” virtual FC adapters on the VIO Servers on the other frame (target frame), with the same ID (2101 and 2102), with a partner adapter ID of 99, just to keep an eye on it.
- I migrate the virtual server (LPM)
- At the arrival on the target frame, there is a cfgmgr and a check : if the adapter ID is not already used, it keeps the same ID.
- If it is actually already used, it takes the first IDs available (let’s say 5 and 6 are available)
- Migration is complete, my client is now connected to VIO server adapter IDs 5 and 6, instead of 2101 and 2102.
- I can now migrate back, with my new “tiny” IDs (even if on the source side, 5 and 6 are taken, it won’t get back to 2101, it will again be set to the next ID available, like 7, 10, or 23, whatever.)
- Now (and only now that the virtual server is gone on the other side) I can change the profile on the target VIO Servers, and set the max_virtual_slots to a more “IBM bug-free compliant” value, like 256, instead of 4096 as it used to be.
- I can also delete the fake adapter used to spoof the server adapter ID (preceded with a rmdev on the vfchost discovered by the cfgmgr executed on the VIOS when the partition migrated)
- Last thing I need to do is shutdown the VIO server (one after another, of course) and restart it, loading the new profile I just modified. With a full working redundancy between my VIO servers, it should not be a problem.
–> I also changed all the max_virtual_slots for the virtual servers , it was set to 255 and I changed that to 32 (default value, shouldn’t be higher anyway)
- Indeed, it changes everything : We can now have 8 concurrent migrations on each p795 (4 for each MSP which is the current limitation, I’m confident that it will grow up one day), as it was expected to do in the first place.
- It also speeds up the duration of the migration (we had before a 10-15 minutes duration for each migration, now it is closer to about 2-6 minutes )
I also discovered that event if the source/target VIO server (not MSP) are set with adapters IDs >1024, BUT the MSPs are with good values (256), concurrent migrations are possible, so the problem of concurrent migration is only caused by the MSP’s max_virtual_slots value (I thought that every VIOs should be lower for this problem, not only the MSPs).
Besides, I also discovered that with high IDs, on the VIOs side (not MSP), it affects the duration of the migration, even if we do concurrent migration.
- So with our “high IDs” policy, we were having 2 problems in one : same cause, multiple consequences !
I Hope it will help somebody, someday !
- If you need to know if your frame is LPM capable :
# lssyscfg -r sys -m mypseries -Fname,active_lpar_mobility_capable,inactive_lpar_mobility_capable mypseries,1,1
–> Here, we can achieve LPM migrations (active AND inactive, which means even if the lpar is shut down)
- If you want more information about the LPM capabilities of your pseries :
# lslparmigr -r sys -m mypseries inactive_lpar_mobility_capable=1,num_inactive_migrations_supported=4,\ num_inactive_migrations_in_progress=0,active_lpar_mobility_capable=1,\ num_active_migrations_supported=8,num_active_migrations_in_progress=0,\ inactive_prof_policy=config
- If you wanna know which of your VIOs is a MSP (sorry for the ugliest grep I’ve ever done) :
# lssyscfg -r lpar -m mypseries -Fname,msp |grep ",1" VIO1,1 VIO2,1
- Checking the migration state of an LPAR on the HMC :
# lslparmigr -r lpar -m mypseries -Fname,migration_state --filter lpar_names="my_mygrating_lpar" my_mygrating_lpar,Not Migrating