Document

Abstract

Segmenting complex and hybrid layers, including skin, subcutaneous fat, skeletal muscle, and bone in arm musculoskeletal ultrasound (MSKUS), is extremely crucial for diagnosing and monitoring Breast-Cancer-Related Lymphedema (BCRL). Nevertheless, previous research has primarily focused on individual muscle or bone segmentation in MSKUS, overlooking the horizontal and curvilinear hybrid morphologies that characterize these structures. To address this limitation, we propose a novel hybrid structure-oriented Transformer with structural prior (HSformer+) to promote structural understanding, a framework capable of simultaneously segmenting multilayer hybrid structures. Specifically, HSformer+ first combines hierarchical-consistency relative position encoding to impose structural morphology bias within local windows. Then we propose a structure-biased constraint for attention calculation, innovatively enhancing feature representation. Ultimately, a structure-oriented objective function is introduced to constrain the spatial location and tissue distribution. Our experiments on a public arm MSKUS dataset demonstrate that HSformer+ achieves state-of-the-art segmentation performance. Further experiments on a private arm MSKUS dataset prove that our method has a strong generalization to clinical small-scale MSKUS datasets.

Examples of arm MSKUS with a hybrid and hierarchical structure, including horizontal layers paralleling to the skin surface and curvilinear layers.

Overall Framework

In light of the aforementioned issues and with reference to above figure, we propose a novel hybrid structure-oriented Transformer with structural prior (HSformer+) for arm MSKUS segmentation. This model simultaneously captures effective feature representations for both the correlations and differences of interlayer and intralayer structures from horizontal and vertical perspectives. To address the challenge of curvilinear structures due to the inherent characteristics of the arm MSKUS and the effect of BCRL, hierarchical-consistency relative position encoding (HCPE) imposes structure bias onto the elements within the local window. Meanwhile, to enhance layer differentiation and emphasize critical features, a structure-biased constraint (SBC) is developed to calculate attention weights. Additionally, we design a novel structure-oriented loss based on spatial location and tissue distribution in arm MSKUS images.

Illustration of the proposed HSformer+ framework.

Comparison Experiments on the Public Arm MSKUS Datasets

Considering the significant characteristics of arm MSKUS segmentation, we design three types of comparative experiments to verify the effectiveness of our model from different perspectives:

We compare HSformer+ with nine representative image segmentation models: in addition to UNet, Attention UNet, UNet++, DeepLabV3, TransUNet, there also AAU-Net, BUSSeg EMCADNet and MADGNet.
We select four models recently developed for MSKUS images segmentation, such as GSCNN, AMCNet, EnsemSeg and SpineSeg.
Furthermore, comparisons are made between HSformer+ and four existing algorithms designed for segmentation of layer-structure medical images, involving TCCT, CoherentSeg, RetiFluidNet and Y-Net.

Table I - Comparisons with nine representative image segmentation models. The green area indicates five classic segmentation models, the blue area represents two recent US segmentation architectures, and the orange area represents the latest SOTA methods.

Layer Structure	Method	U_Net	Attention UNet	UNet++	DeeplabV3	TransUnet	AAU-Net	BUSSeg	EMCADNet	MADGNet	HSformer+
Gel	DSC	0.9273	0.9117	0.9415	0.9466	0.9694	0.9515	0.9587	0.9640	0.9332	0.9810
	JI	0.8659	0.8417	0.8943	0.9044	0.9477	0.9304	0.9419	0.9319	0.9120	0.9639
	HD	2.1875	3.1188	2.2773	2.4916	2.3939	1.9393	2.0603	1.9358	2.2794	1.7481
	ASD	0.4734	0.5312	0.4980	0.5429	0.5429	0.4269	0.4563	0.4297	0.5330	0.3852
Skin	DSC	0.7846	0.7875	0.7844	0.7925	0.8492	0.8319	0.8029	0.8261	0.8015	0.8896
	JI	0.6602	0.6581	0.6563	0.6702	0.7600	0.7544	0.6777	0.7168	0.6723	0.8076
	HD	8.8346	4.4768	4.1263	4.1263	4.2186	3.2168	3.2383	2.3533	2.6547	2.8145
	ASD	1.2272	1.1527	1.3511	1.2235	1.0528	1.1357	1.1324	1.1137	1.2286	0.9740
Subcutaneous Fat	DSC	0.8349	0.8517	0.8515	0.8522	0.8572	0.8798	0.8842	0.8956	0.9021	0.9155
	JI	0.7345	0.7584	0.7570	0.7609	0.7679	0.7986	0.8031	0.8208	0.8299	0.8519
	HD	18.2158	14.3111	15.0515	14.2644	13.9462	13.7089	12.8496	12.1448	11.6314	11.3129
	ASD	4.1608	3.1729	3.8462	3.4544	3.4454	3.2277	2.9266	2.8586	2.9162	2.6104
Muscles	DSC	0.8081	0.8300	0.8376	0.8428	0.8687	0.8807	0.8694	0.9009	0.8740	0.9084
	JI	0.7039	0.7319	0.7421	0.7618	0.7860	0.7882	0.7866	0.8071	0.8091	0.8407
	HD	32.1766	31.1278	26.2814	24.9647	25.7427	24.585	25.9433	23.6565	23.8916	22.6557
	ASD	6.8631	5.3731	5.8816	5.6443	5.9926	5.0030	5.2813	5.6278	5.3866	4.5712
Bones	DSC	0.7891	0.7999	0.8234	0.8272	0.8192	0.8497	0.8695	0.8808	0.8701	0.9173
	JI	0.6934	0.7036	0.7380	0.7502	0.7522	0.7824	0.8030	0.8214	0.7970	0.8564
	HD	34.1490	38.6285	27.7593	26.3163	29.6025	26.0241	28.3172	22.1648	23.4392	21.3251
	ASD	6.7805	6.8423	5.6341	5.0575	6.2708	4.7145	4.7353	4.4625	4.4625	3.3540
Mean	DSC	0.8288	0.8361	0.8476	0.8533	0.8711	0.8771	0.8767	0.8934	0.8834	0.9223
	JI	0.7315	0.7387	0.7575	0.7658	0.8021	0.7955	0.7905	0.8214	0.7970	0.8641
	HD	18.1127	13.1245	15.0992	14.3114	14.9328	13.3448	14.3114	12.6454	12.6484	11.9712
	ASD	3.9010	3.4144	3.4422	3.1845	3.4609	2.9015	2.9063	2.9048	2.9512	2.3789

Table II Comparisons with eight MSKUS or layer-structure segmentation models. The blue area displays four representative MSKUS segmentation models, the green area demonstrates four layer-structure segmentation architectures.

class	indicator	SpineSeg	EnsemSeg	GSCNN	AMCNet	RetiFluidNet	Y-Net	TCCT	CoherentSeg	HSformer+
Gel	DSC	0.8970	0.9372	0.9494	0.9170	0.9397	0.9367	0.9368	0.9051	0.9810
	JI	0.8147	0.8840	0.9096	0.8531	0.8897	0.8842	0.8861	0.8319	0.9639
	HD	4.0013	2.5436	2.3418	2.3733	2.1029	2.2984	2.5272	2.4552	1.7481
	ASD	0.8489	0.5352	0.5210	0.5251	0.4545	0.5116	0.5517	0.4949	0.3852
Skin	DSC	0.7279	0.8407	0.7903	0.7987	0.8280	0.8110	0.7976	0.7823	0.8896
	JI	0.5822	0.7417	0.6632	0.6749	0.7242	0.6858	0.6736	0.6549	0.8076
	HD	5.1969	3.5243	3.5666	3.4505	3.3333	3.5590	3.9399	3.9573	2.8145
	ASD	1.9822	1.1842	1.1818	1.1367	1.0845	1.1886	1.2596	1.2250	0.9740
Subcutaneous Fat	DSC	0.8451	0.8739	0.8604	0.8680	0.8648	0.8676	0.8642	0.8356	0.9155
	JI	0.7470	0.7895	0.7711	0.7714	0.7819	0.7742	0.7742	0.7292	0.8519
	HD	24.9700	15.1477	15.6484	14.8270	15.6120	15.5703	15.6439	20.9985	11.3129
	ASD	5.8961	3.5257	3.8932	3.4860	3.4949	3.4456	4.2678	4.1733	2.6104
Muscles	DSC	0.8556	0.8498	0.8430	0.8524	0.8654	0.8621	0.8617	0.8090	0.9084
	JI	0.7558	0.7658	0.7572	0.7630	0.7859	0.7784	0.7765	0.6869	0.8407
	HD	25.5053	27.4446	26.2865	28.0218	25.6670	27.0126	25.8736	48.0643	22.6557
	ASD	5.1969	5.5926	5.3095	6.0242	5.4491	5.4576	4.9813	8.9130	4.5712
Bones	DSC	0.8347	0.8453	0.8149	0.8252	0.8312	0.8459	0.8458	0.7921	0.9173
	JI	0.7396	0.7662	0.7287	0.7339	0.7472	0.7535	0.7557	0.6751	0.8564
	HD	31.7518	26.2052	28.4942	28.2102	29.0166	27.9990	27.1545	47.1655	21.3251
	ASD	6.5026	5.1348	5.5064	5.0220	5.6171	4.5494	5.6406	8.2199	3.3540
Mean	DSC	0.8320	0.8693	0.8516	0.8523	0.8658	0.8646	0.8612	0.8248	0.9223
	JI	0.7278	0.7894	0.7659	0.7619	0.7765	0.7764	0.7557	0.7156	0.8641
	HD	18.2850	14.9820	15.2675	15.1007	15.1463	15.2878	15.2078	24.5281	11.9712
	ASD	4.0798	3.2611	3.4253	3.1081	3.2200	3.3034	3.4015	4.6046	2.3789

Segmentation results predicted by seventeen models to verify the effectiveness of HSformer+ from different perspectives. Yellow, pink, green, blue and red respectively represent the segmentation of gel, skin, fat, muscle and bone layers.

Ablation Analysis

In this section, (1) we first conduct an ablation study on the Transformer block architecture. As shown in Figure (a), HSformer+ divides K heads into two groups for horizontal and vertical hybrid structure-oriented attention calculations, followed by concatenation. We also design an architecture called HSformer-test, where all K heads are used for both horizontal and vertical feature extraction, followed by broadcasting, as depicted in Figure (b). Subsequently, (2) To evaluate the effectiveness of HCPE, SBC and the novel structure-oriented loss, we implement component analysis with several variants.

Architectures of HSformer+ and HSformer-test.

Comparison with SAM (Segment Anything) Model

Furthermore, we test on the recently popular SAM (Segment Anything). We select two representative SAM-based models: SAM-MP: Prompt includes a randomly marked background point as a negative point, and multiple foreground points marked as positive points; SAM-BOX: The bounding box of the GT mask is directly treated as a prompt. The qualitative results are demonstrated in the following figure.

The segmentation performance of SAM-MP, SAM-BOX and the proposed HSformer+.

In-house Small-scale Arm MSKUS Test

To validate the generalization of HSformer+, we directly test 104 arm MSKUS images without model training, which were collected from Beijing Tsinghua Changgung Hospital using a SonoScape E2 machine. Anatomical structures of the MSKUS images from the in-house and public datasets are consistent, with a large amount of speckle noise and shadow artifacts that pose greater challenges to the segmentation task. Although there are significant differences in image quality and style between the two datasets, this experiment proves that our HSformer+ can achieve favorable segmentation performance on untrained clinical small MSKUS datasets, clearly depict the structures of each tissue layer and show strong generalization.