marcli icon indicating copy to clipboard operation
marcli copied to clipboard

Add YAZ's line mode MARC as another output format

Open pabloab opened this issue 9 months ago • 7 comments

Right now:

-format string Output format. Accepted values: mrk, mrc, xml, json, or solr. (default "mrk")

Please add the yaz-marcdump native format also, called line. From the man page:

-o format Specifies output format. Must be one of marcxml, marc (ISO2709), marcxchange (ISO25577), line (line mode MARC), turbomarc (Turbo MARC), or json (MARC-in-JSON).

pabloab avatar May 01 '25 00:05 pabloab

Could you provide an example of how the format that you are interested look like?

I am not sure how different that format is from the default format in marcli (mrk also known as Mnemonic MARC)

hectorcorrea avatar May 01 '25 13:05 hectorcorrea

$ yaz-marcdump ./pkg/marc/testdata/test_1a.mrc

01805nam a2200385 i 4500
001 ocm57175940
005 20041206161421.0
006 m        d f      
007 cr cn-
008 041206s1976    dcua    sb   f000 0 eng c
040    $a GPO $c GPO $d MvI $d MvI
042    $a pcc
043    $a n-us---
074    $a 0620-A (online)
086 0  $a I 19.4/2:735
100 1  $a Swanson, Vernon E. $q (Vernon Emmanuel), $d 1922-1992.
245 10 $a Guidelines for sample collecting and analytical methods used in the U.S. Geological Survey for determining chemical composition of coal $h [electronic resource] / $c by Vernon E. Swanson and Claude Huffman, Jr.
260    $a [Washington, D.C.] : $b U.S. Dept. of the Interior, U.S. Geological Survey, $c 1976.
336    $a text $2 rdacontent.
337    $a computer $2 rdamedia.
338    $a online resource $2 rdacarrier.
440  0 $a Geological Survey circular ; $v 735.
500    $a Title from title screen (viewed on Dec. 06, 2004)
504    $a Includes bibliographical references.
538    $a Mode of access: Internet from the USGS Web site. Address as of 12/06/04: http://pubs.usgs.gov/circ/c735/index.htm; current access is available via PURL.
650  0 $a Coal $x Analysis.
650  0 $a Coal $x Sampling.
700 1  $a Huffman, Claude.
776 1  $a Swanson, Vernon Emanuel, $d 1922- $t Guidelines for sample collecting and analytical methods used in the U.S. Geological Survey for determining chemical composition of coal $h iv, 11 p. $w (OCoLC)2331861.
856 40 $u http://purl.access.gpo.gov/GPO/LPS56007 $z View online version
907    $a .b37991760 $b 04-08-17 $c 07-26-05
998    $a es001 $b 07-26-05 $c m $d a $e - $f eng $g dcu $h 0 $i 1
910    $a MARCIVE
910    $a Hathi Trust report None
945    $g 0 $j 0 $l esb   $o n $p $0.00 $q   $r   $s - $t 255 $u 0 $v 0 $w 0 $x 0 $y .i138993579 $z 07-26-05

It's a pretty standard MARC suite of tools, from Indexdata, creators of Zebra, used on most Koha deployments. Unlike Marcedit is FOSS.

Also, if you use bat, could try this syntax highlight.

pabloab avatar May 02 '25 19:05 pabloab

Oh, I see, it's pretty close to the format that I use by default (as shown below) but not identical, it does look easy to implement so I'll take a look at implementing this in the next few weeks.

$ ./marcli -file test_1a.mrc 
=LDR  01805nam a2200385 i 4500
=001  ocm57175940
=005  20041206161421.0
=006  m        d f      
=007  cr cn-
=008  041206s1976    dcua    sb   f000 0 eng c
=040  \\$aGPO$cGPO$dMvI$dMvI
=042  \\$apcc
=043  \\$an-us---
=074  \\$a0620-A (online)
=086  0\$aI 19.4/2:735
=100  1\$aSwanson, Vernon E.$q(Vernon Emmanuel),$d1922-1992.
=245  10$aGuidelines for sample collecting and analytical methods used in the U.S. Geological Survey for determining chemical composition of coal$h[electronic resource] /$cby Vernon E. Swanson and Claude Huffman, Jr.
=260  \\$a[Washington, D.C.] :$bU.S. Dept. of the Interior, U.S. Geological Survey,$c1976.
=336  \\$atext$2rdacontent.
=337  \\$acomputer$2rdamedia.
=338  \\$aonline resource$2rdacarrier.
=440  \0$aGeological Survey circular ;$v735.
=500  \\$aTitle from title screen (viewed on Dec. 06, 2004)
=504  \\$aIncludes bibliographical references.
=538  \\$aMode of access: Internet from the USGS Web site. Address as of 12/06/04: http://pubs.usgs.gov/circ/c735/index.htm; current access is available via PURL.
=650  \0$aCoal$xAnalysis.
=650  \0$aCoal$xSampling.
=700  1\$aHuffman, Claude.
=776  1\$aSwanson, Vernon Emanuel,$d1922-$tGuidelines for sample collecting and analytical methods used in the U.S. Geological Survey for determining chemical composition of coal$hiv, 11 p.$w(OCoLC)2331861.
=856  40$uhttp://purl.access.gpo.gov/GPO/LPS56007$zView online version
=907  \\$a.b37991760$b04-08-17$c07-26-05
=998  \\$aes001$b07-26-05$cm$da$e-$feng$gdcu$h0$i1
=910  \\$aMARCIVE
=910  \\$aHathi Trust report None
=945  \\$g0$j0$lesb  $on$p$0.00$q $r $s-$t255$u0$v0$w0$x0$y.i138993579$z07-26-05

hectorcorrea avatar May 05 '25 17:05 hectorcorrea

marcli_linux -file "test_10.mrc" -format yaz > "with_marcli_yaz.txt"
yaz-marcdump "test_10.mrc" > "with-yaz-marcdump.txt"
diff "with_marcli_yaz.txt" "with-yaz-marcdump.txt"

There are several differences. At least: no new line between records (and at the end). Also there is an extra space at the end of each data field.

pabloab avatar Aug 06 '25 22:08 pabloab

ah! I didn't think about testing with multiple records in the file. Let me take a closer look. Thank you for testing it so quickly and reporting the errors!

hectorcorrea avatar Aug 07 '25 12:08 hectorcorrea

Version 1.3.1 fixes this, the output that I get with marcli is identical to the one with yaz-marcdump. Let me know if you see other issues. Thank you!

hectorcorrea avatar Aug 10 '25 15:08 hectorcorrea

I tested with more files. Some issues found:

  1. Removed 001 if 100% numeric
  2. Removed spaces at the end of last subfield
  3. Removed subfields present but empty
  4. Problems with some Unicode characters (Greek letters, diacritics). For e.g.: México turns to México

This lines might be handy

#!/bin/bash
find . -type f -name "*.mrc" | while read -r file; do
  echo "Processing $file"

  out1=$(yaz-marcdump "$file")
  out2=$(marcli_linux -file "$file" -format yaz)

  tmp1=$(mktemp)
  tmp2=$(mktemp)

  echo "$out1" > "$tmp1"
  echo "$out2" > "$tmp2"

  if ! diff -q "$tmp1" "$tmp2" > /dev/null; then
    echo "Differences found in $file (first 5 differences side-by-side):"
    diff --color=always -y --suppress-common-lines --color=always "$tmp1" "$tmp2" | head -n 5
  else
    echo "Outputs are identical for $file"
  fi

  echo
  rm "$tmp1" "$tmp2"
done

pabloab avatar Aug 12 '25 23:08 pabloab