Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Openscape 4000 v10 standby ecoserver can't boot 2

Status
Not open for further replies.

iiaacc

Systems Engineer
Sep 13, 2023
6
US
Openscape 4000 v10 standby ecoserver can't boot.
i tried to make a "replace node" -> 微信图片_20240918175651.png
the ecoserver screen awlays 0%,the current page returns after 15 mins, i tried many times and the situation is always the same.
We know that replace node is a network BOOT method initiated by the actived ecoserver in TFTP mode. It is through corosync LAN (ETH4). I have confirmed that the two ecoservers have looped back ETH4 and ETH6.
Is there any way to confirm whether it is a hard disk failure or a problem with the standby ecoserver mainboard? I am not near these two devices. Are there any suggestions to confirm this problem? Can a monitor and keyboard be used to log in to the standby ecoserver to check BIOS or the SSD(hard disk), or can the hard disk of the active ecoserver be used for booting to determine whether the standby ecoserver is working or whether a new set of primary and standby hard disks needs to be recreated


ecoserver4-6自环.jpg
 
That minimal log looks like it's working, it looks like the active node has done what it needs to do.

Ideally you want a monitor on the dead node so you can see what it's doing. Difficult if you are not there I realise.

You can also watch what happens from the USB connection, if someone can connect laptop to 4K with USB cable (A-B, printer cable). You can use putty but would need the USB driver, which is on the install image for 4K.

It usually takes about 30 min to recover the node. It has to make fresh install of the linux and then sync the DBs from the active processor.

You asked if you could use the active node hardisk to boot standby. Well yes, you could, But should you. You would have to either power down node A, or completely disconnect node B from everything. Node B would boot with the IP addresses from Node A, and the RTM is connected to the B connector on the LTUCs. It's messy. If node A is RAIDed you could stop the RAID and then use that HD in B to receive the reinstall. You would have to be very careful node B did not boot as node A. If it didn't work, put it back in node A and restart the RAID.

Ideally here, first, you want a monitor or laptop on Node B to watch what happens during node recovery, to find where the process is failing.
 
Silly Question but if there is a USB Stick in the B node - it might not boot up - there is a setting in the BIOS ?
 
That minimal log looks like it's working, it looks like the active node has done what it needs to do.

Ideally you want a monitor on the dead node so you can see what it's doing. Difficult if you are not there I realise.

You can also watch what happens from the USB connection, if someone can connect laptop to 4K with USB cable (A-B, printer cable). You can use putty but would need the USB driver, which is on the install image for 4K.

It usually takes about 30 min to recover the node. It has to make fresh install of the linux and then sync the DBs from the active processor.

You asked if you could use the active node hardisk to boot standby. Well yes, you could, But should you. You would have to either power down node A, or completely disconnect node B from everything. Node B would boot with the IP addresses from Node A, and the RTM is connected to the B connector on the LTUCs. It's messy. If node A is RAIDed you could stop the RAID and then use that HD in B to receive the reinstall. You would have to be very careful node B did not boot as node A. If it didn't work, put it back in node A and restart the RAID.

Ideally here, first, you want a monitor or laptop on Node B to watch what happens during node recovery, to find where the process is failing.
@Moriendi, Thank you for your so detailed suggestion,
Later, I asked the on-site engineer to move the SSD of node A to node B. The E从server node b can startup. This proves that there is indeed a problem with the SSD of node B. Therefore, we are going to re-collect first-installation*.xml to remake a set of duplex SSD to replace the two hard drives of this system and use the SMART tool to detect the problem of the SSD of node B. Are there any suggestions for doing this?
 
Silly Question but if there is a USB Stick in the B node - it might not boot up - there is a setting in the BIOS ?
I am not on site. The end user does not have a laptop or such cables and monitors.they can only take pictures.
 
One more question,UNIFY has used two types of SSD hard disks to store data in recent years,Transcend and Innodisk both use 3D-TLC NAND. In many projects, we often find that problems occur in NODE B SSD. Do you know if this situation is an inherent problem of TLC NAND?
 

Attachments

  • innode.jpg
    innode.jpg
    108.7 KB · Views: 9
  • transcend.jpg
    transcend.jpg
    153.1 KB · Views: 9
If you have proven the node B SSD is bad, all you should need to do is start node recovery on Node A, and let Node B reinstall once it has a good SSD inserted. You only need to remake node B disc manually (with install xml and USB drive) if there is a problem with the node recovery process. If you have the spare RAID disc from A inserted in B, you should be able to start that process.

The important thing is that you get your replacement SSD from Unify. They will supply the recommended model which has been tested and released for OS4K. Other SSDs might look the same and work initially, but you need the correct drive/firmware versions which have been tested and released or you can get strange problems.

I'm not aware of any issue with the drives supplied, in fact I would say failed SSDs are rare. More likely any problem is corruption of the OS, which is a linux issue that can solved with node recovery. But, sometimes these parts fail as seems to be the case here, and must be replaced. I'm not sure what benefit the smartdata would bring in this case.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top