我们使用Solr Replication可以实现Solr服务器的可用性,即使某一个索引副本由于磁盘介质故障或者误操作删除等,其他的多个复制副本仍然可以提供服务。如果只是单纯的基于Solr Replication技术,只能对一个索引进行管理维护,当索引数据达到一定规模,搜索的性能成了瓶颈,除了重新规划设计索引,实现逻辑划分以外,没有更好地方法实现查询服务器的可扩展性。
SolrCloud就是为了解决这个问题而提出的。SolrCloud通过ZooKeeper集群来进行协调,使一个索引(SolrCloud中叫做一个Collection)进行分片,各个分片可以分布在不同的物理节点上,而且,对于同一个Collection的多个分片(Shard)之间没有交集,亦即,多个物理分片组成一个完成的索引Collection。为了保证分片数据的可用性,SolrCloud自动支持Solr Replication,可以同时对分片进行复制,冗余存储。下面,我们基于Solr最新的4.3.1版本进行安装配置SolrCloud集群,通过实践来实现索引数据的分布存储和检索。
准备工作
- 服务器信息
三台服务器:
10.95.3.61 master 10.95.3.62 slave1 10.95.3.65 slave4
- ZooKeeper集群配置
安装ZooKeeper集群,在上面3分节点上分别安装,使用的版本是zookeeper-3.4.5。 首先,在master节点上配置zoo.cfg,内容如下所示:
[hadoop@master ~]$ vi applications/zookeeper/zookeeper-3.4.5/conf/zoo.cfg # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. # do not use /tmp for storage, /tmp here is just # example sakes. dataDir=/home/hadoop/applications/zookeeper/zookeeper-3.4.5/data # the port at which the clients will connect clientPort=2188 dataLogDir=/home/hadoop/applications/zookeeper/zookeeper-3.4.5/data/logs server.1=master:4888:5888 server.2=slave1:4888:5888 server.3=slave4:4888:5888 # # Be sure to read the maintenance section of the # administrator guide before turning on autopurge. # # http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance # # The number of snapshots to retain in dataDir #autopurge.snapRetainCount=3 # Purge task interval in hours # Set to "0" to disable auto purge feature #autopurge.purgeInterval=1
然后,创建对应的数据存储目录后,可以直接将该配置复制到其他两个节点上:
scp -r applications/zookeeper/zookeeper-3.4.5 hadoop@slave1:~/applications/zookeeper/ scp -r applications/zookeeper/zookeeper-3.4.5 hadoop@slave4:~/applications/zookeeper/
启动ZooKeeper集群,在每个节点上分别启动ZooKeeper服务:
cd applications/zookeeper/zookeeper-3.4.5/ bin/zkServer.sh start
可以查看ZooKeeper集群的状态,保证集群启动没有问题:
[hadoop@master zookeeper-3.4.5]$ bin/zkServer.sh status JMX enabled by default Using config: /home/hadoop/applications/zookeeper/zookeeper-3.4.5/bin/../conf/zoo.cfg Mode: follower [hadoop@slave1 zookeeper-3.4.5]$ bin/zkServer.sh status JMX enabled by default Using config: /home/hadoop/applications/zookeeper/zookeeper-3.4.5/bin/../conf/zoo.cfg Mode: follower [hadoop@slave4 zookeeper-3.4.5]$ bin/zkServer.sh status JMX enabled by default Using config: /home/hadoop/applications/zookeeper/zookeeper-3.4.5/bin/../conf/zoo.cfg Mode: leader
可以看到,slave4节点是ZooKeeper集群服务Leader。
- SolrCloud相关目录
我们选择/home/hadoop/applications/solr/cloud目录存放Solr的库文件和配置文件,该目录下有lib和multicore两个子目录。 另外,还有一个存储索引的目录,设置为/home/hadoop/applications/storage/cloud/data。
SolrCloud配置
首先在一个节点上对SOLR进行配置,我们选择master节点。
1、SOLR基本配置
将下载的SOLR的压缩包解压缩,将solr-4.3.1\example\webapps\solr.war解开,将solr-4.3.1\example\webapps\solr\WEB-INF\lib和solr-4.3.1\example\lib\ext中的jar文件拷贝到solr-4.3.1\example\webapps\solr\WEB-INF\lib中,并将解开的solr目录改名为solr-cloud,然后传到服务器的Tomcat下的webapps目录下。 将solr-4.3.1\example\webapps\solr\WEB-INF\lib和solr-4.3.1\example\lib\ext下面的jar文件都拷贝到指定目录/home/hadoop/applications/solr/cloud/lib/中:
[hadoop@master ~]$ ls /home/hadoop/applications/solr/cloud/lib/ commons-cli-1.2.jar lucene-analyzers-common-4.3.1.jar lucene-suggest-4.3.1.jar commons-codec-1.7.jar lucene-analyzers-kuromoji-4.3.1.jar noggit-0.5.jar commons-fileupload-1.2.1.jar lucene-analyzers-phonetic-4.3.1.jar org.restlet-2.1.1.jar commons-io-2.1.jar lucene-codecs-4.3.1.jar org.restlet.ext.servlet-2.1.1.jar commons-lang-2.6.jar lucene-core-4.3.1.jar slf4j-api-1.6.6.jar guava-13.0.1.jar lucene-grouping-4.3.1.jar slf4j-log4j12-1.6.6.jar httpclient-4.2.3.jar lucene-highlighter-4.3.1.jar solr-core-4.3.1.jar httpcore-4.2.2.jar lucene-memory-4.3.1.jar solr-solrj-4.3.1.jar httpmime-4.2.3.jar lucene-misc-4.3.1.jar spatial4j-0.3.jar jcl-over-slf4j-1.6.6.jar lucene-queries-4.3.1.jar wstx-asl-3.2.7.jar jul-to-slf4j-1.6.6.jar lucene-queryparser-4.3.1.jar zookeeper-3.4.5.jar log4j-1.2.16.jar lucene-spatial-4.3.1.jar
目录/home/hadoop/applications/solr/cloud/multicore的结构,如图所示:
下面,我们对上面conf目录下的配置文件进行说明:
- schema.xml文件
<?xml version="1.0" ?> <schema name="example core two" version="1.1"> <types> <fieldtype name="string" omitNorms="true" /> <fieldType name="long" /> <fieldtype name="int" /> <fieldtype name="float" /> <fieldType name="date" precisionStep="0" positionIncrementGap="0" /> </types> <fields> <field name="id" type="long" indexed="true" stored="true" multiValued="false" required="true" /> <field name="area" type="string" indexed="true" stored="false" multiValued="false" /> <field name="building_type" type="int" indexed="true" stored="false" multiValued="false" /> <field name="category" type="string" indexed="true" stored="false" multiValued="false" /> <field name="temperature" type="int" indexed="true" stored="false" multiValued="false" /> <field name="code" type="int" indexed="true" stored="false" multiValued="false" /> <field name="latitude" type="float" indexed="true" stored="false" multiValued="false" /> <field name="longitude" type="float" indexed="true" stored="false" multiValued="false" /> <field name="when" type="date" indexed="true" stored="false" multiValued="false" /> <field name="_version_" type="long" indexed="true" stored="true" /> </fields> <uniqueKey>id</uniqueKey> <defaultSearchField>area</defaultSearchField> <solrQueryParser defaultOperator="OR" /> </schema>
- solrconfig.xml文件
<?xml version="1.0" encoding="UTF-8" ?> <config> <luceneMatchVersion>LUCENE_43</luceneMatchVersion> <directoryFactory name="DirectoryFactory" /> <dataDir>${solr.shard.data.dir:}</dataDir> <schemaFactory /> <updateHandler> <updateLog> <str name="dir">${solr.shard.data.dir:}</str> </updateLog> </updateHandler> <!-- realtime get handler, guaranteed to return the latest stored fields of any document, without the need to commit or open a new searcher. The current implementation relies on the updateLog feature being enabled. --> <requestHandler name="/get"> <lst name="defaults"> <str name="omitHeader">true</str> </lst> </requestHandler> <requestHandler name="/replication" startup="lazy" /> <requestDispatcher handleSelect="true"> <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" formdataUploadLimitInKB="2048" /> </requestDispatcher> <requestHandler name="standard" default="true" /> <requestHandler name="/analysis/field" startup="lazy" /> <requestHandler name="/update" /> <requestHandler name="/update/csv" startup="lazy"> <lst name="defaults"> <str name="separator">,</str> <str name="header">true</str> <str name="encapsulator">"</str> </lst> <updateLog> <str name="dir">${solr.shard.data.dir:}</str> </updateLog> </requestHandler> <requestHandler name="/admin/" /> <requestHandler name="/admin/ping"> <lst name="invariants"> <str name="q">solrpingquery</str> </lst> <lst name="defaults"> <str name="echoParams">all</str> </lst> </requestHandler> <updateRequestProcessorChain name="sample"> <processor /> <processor /> <processor /> </updateRequestProcessorChain> <query> <maxBooleanClauses>1024</maxBooleanClauses> <filterCache size="10240" initialSize="512" autowarmCount="0" /> <queryResultCache size="10240" initialSize="512" autowarmCount="0" /> <documentCache size="10240" initialSize="512" autowarmCount="0" /> <enableLazyFieldLoading>true</enableLazyFieldLoading> <queryResultWindowSize>20</queryResultWindowSize> <queryResultMaxDocsCached>200</queryResultMaxDocsCached> <maxWarmingSearchers>2</maxWarmingSearchers> </query> <admin> <defaultQuery>solr</defaultQuery> </admin> </config>
- solrcore.properties文件
solr.shard.data.dir=/home/hadoop/applications/storage/cloud/data
属性solr.shard.data.dir在solrconfig.xml文件中被引用过,指定索引数据的存放位置。
- solr.xml文件
该文件中指定了ZooKeeper的相关配置,已经Solr Core的配置内容:
<?xml version="1.0" encoding="UTF-8" ?> <solr persistent="true"> <cores defaultCoreName="collection1" host="${host:}" adminPath="/admin/cores" zkClientTimeout="${zkClientTimeout:15000}" hostPort="8888" hostContext="${hostContext:solr-cloud}"> </cores> </solr>
注意:这里,我们并没有配置任何的core元素,这个等到整个配置安装完成之后,通过SOLR提供的REST接口,来实现Collection以及Shard的创建,从而来更新这些配置文件。
2、ZooKeeper管理监控配置文件
SolrCloud是通过ZooKeeper集群来保证配置文件的变更及时同步到各个节点上,所以,需要将配置文件上传到ZooKeeper集群中:
java -classpath .:/home/hadoop/applications/solr/cloud/lib/* org.apache.solr.cloud.ZkCLI -cmd upconfig -zkhost master:2188,slave1:2188,slave4:2188 -confdir /home/hadoop/applications/solr/cloud/multicore/collection1/conf -confname myconf java -classpath .:/home/hadoop/applications/solr/cloud/lib/* org.apache.solr.cloud.ZkCLI -cmd linkconfig -collection collection1 -confname myconf -zkhost master:2188,slave1:2188,slave4:2188
上传完成以后,我们检查一下ZooKeeper上的存储情况:
[hadoop@master ~]$ cd applications/zookeeper/zookeeper-3.4.5/ [hadoop@master zookeeper-3.4.5]$ bin/zkCli.sh -server master:2188 ... [zk: master:2188(CONNECTED) 0] ls / [configs, collections, zookeeper] [zk: master:2188(CONNECTED) 2] ls /configs [myconf] [zk: master:2188(CONNECTED) 3] ls /configs/myconf [solrcore.properties, solrconfig.xml, schema.xml]
3、Tomcat配置与启动
在Tomcat的启动脚本bin/catalina.sh中,增加如下配置:
JAVA_OPTS="-server -Xmx4096m -Xms1024m -verbose:gc -Xloggc:solr_gc.log -Dsolr.solr.home=/home/hadoop/applications/solr/cloud/multicore -DzkHost=master:2188,slave1:2188,slave4:2188"
启动Tomcat服务器:
cd servers/apache-tomcat-7.0.42 bin/catalina.sh start
查看日志:
cd servers/apache-tomcat-7.0.42 tail -100f logs/catalina.out
我们查看一下ZooKeeper中的数据状态,如下所示:
[hadoop@master apache-tomcat-7.0.42]$ cd ~/applications/zookeeper/zookeeper-3.4.5/ [hadoop@master zookeeper-3.4.5]$ bin/zkCli.sh -server master:2188 ... [zk: master:2188(CONNECTED) 0] ls / [configs, zookeeper, clusterstate.json, aliases.json, live_nodes, overseer, overseer_elect, collections] [zk: master:2188(CONNECTED) 1] ls /live_nodes [10.95.3.61:8888_solr-cloud] [zk: master:2188(CONNECTED) 2] ls /collections [collection1]
这时候,SolrCloud集群中只有一个活跃的节点,而且默认生成了一个collection1实例,这个实例实际上虚拟的,因为通过web界面无法访问http://master:8888/solr-cloud/,看不到任何有关SolrCloud的信息,如图所示:
4、同步数据和配置信息,启动其他节点
在另外两个节点上安装Tomcat和Solr服务器,只需要拷贝对应的目录即可:
[hadoop@master ~]$ scp -r servers/ hadoop@slave1:~/ [hadoop@master ~]$ scp -r servers/ hadoop@slave4:~/ [hadoop@master ~]$ scp -r applications/solr/cloud hadoop@slave1:~/applications/solr/ [hadoop@master ~]$ scp -r applications/solr/cloud hadoop@slave4:~/applications/solr/ [hadoop@slave1 ~]$ mkdir -p applications/storage/cloud/data/ [hadoop@slave4 ~]$ mkdir -p applications/storage/cloud/data/
启动其他Solr服务器节点:
[hadoop@slave1 ~]$ cd servers/apache-tomcat-7.0.42 [hadoop@slave1 apache-tomcat-7.0.42]$ bin/catalina.sh start [hadoop@slave4 ~]$ cd servers/apache-tomcat-7.0.42 [hadoop@slave4 apache-tomcat-7.0.42]$ bin/catalina.sh start
查看ZooKeeper集群中数据状态:
[zk: master:2188(CONNECTED) 3] ls /live_nodes [10.95.3.65:8888_solr-cloud, 10.95.3.61:8888_solr-cloud, 10.95.3.62:8888_solr-cloud]
这时,已经存在3个活跃的节点了,但是SolrCloud集群并没有更多信息,访问http://master:8888/solr-cloud/后,同上面的图是一样的,没有SolrCloud相关数据。
5、创建Collection、Shard和Replication
- 创建Collection及初始Shard
直接通过REST接口来创建Collection,如下所示:
[hadoop@master ~]$ curl 'http://master:8888/solr-cloud/admin/collections?action=CREATE&name=mycollection&numShards=3&replicationFactor=1'
如果成功,会输出如下响应内容:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">4103</int> </lst> <lst name="success"> <lst> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">3367</int> </lst> <str name="core">mycollection_shard2_replica1</str> <str name="saved">/home/hadoop/applications/solr/cloud/multicore/solr.xml</str> </lst> <lst> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">3280</int> </lst> <str name="core">mycollection_shard1_replica1</str> <str name="saved">/home/hadoop/applications/solr/cloud/multicore/solr.xml</str> </lst> <lst> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">3690</int> </lst> <str name="core">mycollection_shard3_replica1</str> <str name="saved">/home/hadoop/applications/solr/cloud/multicore/solr.xml</str> </lst> </lst> </response>
上面链接中的几个参数的含义,说明如下:
name 待创建Collection的名称 numShards 分片的数量 replicationFactor 复制副本的数量
执行上述操作如果没有异常,已经创建了一个Collection,名称为mycollection,而且每个节点上存在一个分片。这时,也可以查看ZooKeeper中状态:
[zk: master:2188(CONNECTED) 5] ls /collections [mycollection, collection1] [zk: master:2188(CONNECTED) 6] ls /collections/mycollection [leader_elect, leaders]
可以通过Web管理页面,访问http://master:8888/solr-cloud/#/~cloud,查看SolrCloud集群的分片信息,如图所示:
由上图可以看到,对应节点上SOLR分片的对应关系:
shard3 10.95.3.61 master shard1 10.95.3.62 slave1 shard2 10.95.3.65 slave4
实际上,我们从master节点可以看到,SOLR的配置文件内容,已经发生了变化,如下所示:
[hadoop@master ~]$ cat applications/solr/cloud/multicore/solr.xml <?xml version="1.0" encoding="UTF-8" ?> <solr persistent="true"> <cores defaultCoreName="collection1" host="${host:}" adminPath="/admin/cores" zkClientTimeout="${zkClientTimeout:15000}" hostPort="8888" hostContext="${hostContext:solr-cloud}"> <core loadOnStartup="true" shard="shard3" instanceDir="mycollection_shard3_replica1/" transient="false" name="mycollection_shard3_replica1" collection="mycollection" /> </cores> </solr>
- 创建Replication
下面对已经创建的初始分片进行复制。 shard1已经在slave1上,我们复制分片到master和slave4上,执行如下命令:
[hadoop@master ~]$ curl 'http://master:8888/solr-cloud/admin/cores?action=CREATE&collection=mycollection&name=mycollection_shard1_replica_2&shard=shard1' <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1485</int> </lst> <str name="core">mycollection_shard1_replica_2</str> <str name="saved">/home/hadoop/applications/solr/cloud/multicore/solr.xml</str> </response> [hadoop@master ~]$ curl 'http://master:8888/solr-cloud/admin/cores?action=CREATE&collection=mycollection&name=mycollection_shard1_replica_3&shard=shard1' <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">2543</int> </lst> <str name="core">mycollection_shard1_replica_3</str> <str name="saved">/home/hadoop/applications/solr/cloud/multicore/solr.xml</str> </response> [hadoop@slave4 ~]$ curl 'http://slave4:8888/solr-cloud/admin/cores?action=CREATE&collection=mycollection&name=mycollection_shard1_replica_4&shard=shard1' <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">2405</int> </lst> <str name="core">mycollection_shard1_replica_4</str> <str name="saved">/home/hadoop/applications/solr/cloud/multicore/solr.xml</str> </response>
最后的结果是,slave1上的shard1,在master节点上有2个副本,名称为mycollection_shard1_replica_2和mycollection_shard1_replica_3,在slave4节点上有一个副本,名称为mycollection_shard1_replica_4. 也可以通过查看master和slave4上的目录变化,如下所示:
[hadoop@master ~]$ ll applications/solr/cloud/multicore/ 总用量 24 drwxrwxr-x. 4 hadoop hadoop 4096 8月 1 09:58 collection1 drwxrwxr-x. 3 hadoop hadoop 4096 8月 1 15:41 mycollection_shard1_replica_2 drwxrwxr-x. 3 hadoop hadoop 4096 8月 1 15:42 mycollection_shard1_replica_3 drwxrwxr-x. 3 hadoop hadoop 4096 8月 1 15:23 mycollection_shard3_replica1 -rw-rw-r--. 1 hadoop hadoop 784 8月 1 15:42 solr.xml -rw-rw-r--. 1 hadoop hadoop 1004 8月 1 10:02 zoo.cfg [hadoop@slave4 ~]$ ll applications/solr/cloud/multicore/ 总用量 20 drwxrwxr-x. 4 hadoop hadoop 4096 8月 1 14:53 collection1 drwxrwxr-x. 3 hadoop hadoop 4096 8月 1 15:44 mycollection_shard1_replica_4 drwxrwxr-x. 3 hadoop hadoop 4096 8月 1 15:23 mycollection_shard2_replica1 -rw-rw-r--. 1 hadoop hadoop 610 8月 1 15:44 solr.xml -rw-rw-r--. 1 hadoop hadoop 1004 8月 1 15:08 zoo.cfg
其中,mycollection_shard3_replica1和mycollection_shard2_replica1都是创建Collection的时候自动生成的分片,也就是第一个副本。 通过Web界面,可以更加直观地看到shard1的情况,如图所示:
我们再次从master节点可以看到,SOLR的配置文件内容,又发生了变化,如下所示:
[hadoop@master ~]$ cat applications/solr/cloud/multicore/solr.xml <?xml version="1.0" encoding="UTF-8" ?> <solr persistent="true"> <cores defaultCoreName="collection1" host="${host:}" adminPath="/admin/cores" zkClientTimeout="${zkClientTimeout:15000}" hostPort="8888" hostContext="${hostContext:solr-cloud}"> <core loadOnStartup="true" shard="shard3" instanceDir="mycollection_shard3_replica1/" transient="false" name="mycollection_shard3_replica1" collection="mycollection" /> <core loadOnStartup="true" shard="shard1" instanceDir="mycollection_shard1_replica_2/" transient="false" name="mycollection_shard1_replica_2" collection="mycollection" /> <core loadOnStartup="true" shard="shard1" instanceDir="mycollection_shard1_replica_3/" transient="false" name="mycollection_shard1_replica_3" collection="mycollection" /> </cores> </solr>
到此为止,我们已经基于3个物理节点,配置完成了SolrCloud集群。
索引数据
我们根据前面定义的schema.xml,自己构造了一个数据集,代码如下所示:
package org.shirdrn.solr.data; import java.io.BufferedWriter; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStreamWriter; import java.text.DateFormat; import java.text.SimpleDateFormat; import java.util.Date; import java.util.Random; public class BuildingSampleGenerator { private final DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"); private Random random = new Random(); static String[] areas = { "北京", "上海", "深圳", "广州", "天津", "重庆","成都", "银川", "沈阳", "大连", "吉林", "郑州", "徐州", "兰州", "东京", "纽约", "贵州", "长春", "大连", "武汉","南京", "海口", "太原", "济南", "日照", "菏泽", "包头", "松原" }; long pre = 0L; long current = 0L; public synchronized long genId() { current = System.nanoTime(); if(current == pre) { try { Thread.sleep(0, 1); } catch (InterruptedException e) { e.printStackTrace(); } current = System.nanoTime(); pre = current; } return current; } public String genArea() { return areas[random.nextInt(areas.length)]; } private int maxLatitude = 90; private int maxLongitude = 180; public Coordinate genCoordinate() { int beforeDot = random.nextInt(maxLatitude); double afterDot = random.nextDouble(); double lat = beforeDot + afterDot; beforeDot = random.nextInt(maxLongitude); afterDot = random.nextDouble(); double lon = beforeDot + afterDot; return new Coordinate(lat, lon); } private Random random1 = new Random(System.currentTimeMillis()); private Random random2 = new Random(2 * System.currentTimeMillis()); public int genFloors() { return 1 + random1.nextInt(50) + random2.nextInt(50); } public class Coordinate { double latitude; double longitude; public Coordinate() { super(); } public Coordinate(double latitude, double longitude) { super(); this.latitude = latitude; this.longitude = longitude; } public double getLatitude() { return latitude; } public double getLongitude() { return longitude; } } static int[] signs = {-1, 1}; public int genTemperature() { return signs[random.nextInt(2)] * random.nextInt(81); } static String[] codes = {"A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"}; public String genCode() { return codes[random.nextInt(codes.length)]; } static int[] types = {0, 1, 2, 3}; public int genBuildingType() { return types[random.nextInt(types.length)]; } static String[] categories = { "办公建筑", "教育建筑", "商业建筑", "文教建筑", "医卫建筑", "住宅", "宿舍", "公寓", "工业建筑"}; public String genBuildingCategory() { return categories[random.nextInt(categories.length)]; } public void generate(String file, int count) throws IOException { BufferedWriter w = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "UTF-8")); w.write("id,area,building_type,category,temperature,code,latitude,longitude,when"); w.newLine(); for(int i=0; i<count; i++) { String when = df.format(new Date()); StringBuffer sb = new StringBuffer(); sb.append(genId()).append(",") .append("\"").append(genArea()).append("\"").append(",") .append(genBuildingType()).append(",") .append("\"").append(genBuildingCategory()).append("\"").append(",") .append(genTemperature()).append(",") .append(genCode()).append(","); Coordinate coord = genCoordinate(); sb.append(coord.latitude).append(",") .append(coord.longitude).append(",") .append("\"").append(when).append("\""); w.write(sb.toString()); w.newLine(); } w.close(); System.out.println("Finished: file=" + file); } public static void main(String[] args) throws Exception { BuildingSampleGenerator gen = new BuildingSampleGenerator(); String file = "E:\\Develop\\eclipse-jee-kepler\\workspace\\solr-data\\building_files"; for(int i=0; i<=9; i++) { String f = new String(file + "_100w_0" + i + ".csv"); gen.generate(f, 5000000); } } }
生成的文件,如下所示:
[hadoop@master solr-data]$ ll building_files_100w* -rw-rw-r--. 1 hadoop hadoop 109025853 7月 26 14:05 building_files_100w_00.csv -rw-rw-r--. 1 hadoop hadoop 108015504 7月 26 10:53 building_files_100w_01.csv -rw-rw-r--. 1 hadoop hadoop 108022184 7月 26 11:00 building_files_100w_02.csv -rw-rw-r--. 1 hadoop hadoop 108016854 7月 26 11:00 building_files_100w_03.csv -rw-rw-r--. 1 hadoop hadoop 108021750 7月 26 11:00 building_files_100w_04.csv -rw-rw-r--. 1 hadoop hadoop 108017496 7月 26 11:00 building_files_100w_05.csv -rw-rw-r--. 1 hadoop hadoop 108016193 7月 26 11:00 building_files_100w_06.csv -rw-rw-r--. 1 hadoop hadoop 108023537 7月 26 11:00 building_files_100w_07.csv -rw-rw-r--. 1 hadoop hadoop 108014684 7月 26 11:00 building_files_100w_08.csv -rw-rw-r--. 1 hadoop hadoop 108022044 7月 26 11:00 building_files_100w_09.csv
数据文件格式如下:
[hadoop@master solr-data]$ head building_files_100w_00.csv id,area,building_type,category,temperature,code,latitude,longitude,when 18332617097417,"广州",2,"医卫建筑",61,N,5.160762478343409,62.92919119315037,"2013-07-26T14:05:55.832Z" 18332617752331,"成都",1,"教育建筑",10,Q,77.34792453477195,72.59812030045762,"2013-07-26T14:05:55.833Z" 18332617815833,"大连",0,"教育建筑",18,T,81.47569061530493,0.2177194388096203,"2013-07-26T14:05:55.833Z" 18332617903711,"广州",0,"办公建筑",31,D,51.85825084513671,13.60710950097155,"2013-07-26T14:05:55.833Z" 18332617958555,"深圳",3,"商业建筑",5,H,22.181374031472675,119.76001810254823,"2013-07-26T14:05:55.833Z" 18332618020454,"济南",3,"公寓",-65,L,84.49607030736806,29.93095171443135,"2013-07-26T14:05:55.834Z" 18332618075939,"北京",2,"住宅",-29,J,86.61660177436184,39.20847527640485,"2013-07-26T14:05:55.834Z" 18332618130141,"菏泽",0,"医卫建筑",24,J,70.57574551258345,121.21977908377244,"2013-07-26T14:05:55.834Z" 18332618184343,"徐州",2,"办公建筑",31,W,0.10129771041097524,153.40533210345387,"2013-07-26T14:05:55.834Z"
我们向已经搭建好的SolrCloud集群,执行索引数据的操作。这里,实现了一个简易的客户端,代码如下所示:
package org.shirdrn.solr.indexing; import java.io.IOException; import java.net.MalformedURLException; import java.text.DateFormat; import java.text.SimpleDateFormat; import java.util.Date; import org.apache.solr.client.solrj.SolrServerException; import org.apache.solr.client.solrj.impl.CloudSolrServer; import org.apache.solr.common.SolrInputDocument; import org.shirdrn.solr.data.BuildingSampleGenerator; import org.shirdrn.solr.data.BuildingSampleGenerator.Coordinate; public class CloudSolrClient { private CloudSolrServer cloudSolrServer; public synchronized void open(final String zkHost, final String defaultCollection, int zkClientTimeout, final int zkConnectTimeout) { if (cloudSolrServer == null) { try { cloudSolrServer = new CloudSolrServer(zkHost); cloudSolrServer.setDefaultCollection(defaultCollection); cloudSolrServer.setZkClientTimeout(zkClientTimeout); cloudSolrServer.setZkConnectTimeout(zkConnectTimeout); } catch (MalformedURLException e) { System.out .println("The URL of zkHost is not correct!! Its form must as below:\n zkHost:port"); e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } } } public void addDoc(long id, String area, int buildingType, String category, int temperature, String code, double latitude, double longitude, String when) { try { SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", id); doc.addField("area", area); doc.addField("building_type", buildingType); doc.addField("category", category); doc.addField("temperature", temperature); doc.addField("code", code); doc.addField("latitude", latitude); doc.addField("longitude", longitude); doc.addField("when", when); cloudSolrServer.add(doc); cloudSolrServer.commit(); } catch (SolrServerException e) { System.err.println("Add docs Exception !!!"); e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } catch (Exception e) { System.err.println("Unknowned Exception!!!!!"); e.printStackTrace(); } } public static void main(String[] args) { final String zkHost = "master:2188"; final String defaultCollection = "mycollection"; final int zkClientTimeout = 20000; final int zkConnectTimeout = 1000; CloudSolrClient client = new CloudSolrClient(); client.open(zkHost, defaultCollection, zkClientTimeout, zkConnectTimeout); BuildingSampleGenerator gen = new BuildingSampleGenerator(); final DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"); for(int i = 0; i < 10000; i++) { long id = gen.genId(); String area = gen.genArea(); int buildingType = gen.genBuildingType(); String category = gen.genBuildingCategory(); int temperature = gen.genTemperature(); String code = gen.genCode(); Coordinate coord = gen.genCoordinate(); double latitude = coord.getLatitude(); double longitude = coord.getLongitude(); String when = df.format(new Date()); client.addDoc(id, area, buildingType, category, temperature, code, latitude, longitude, when); } } }
这样,可以查看SolrCloud管理页面,或者直接登录到服务器上,能够看到对应索引数据分片的情况,比较均匀地分布到各个Shard节点上。 当然,也可以从Web管理页面上来管理各个分片的副本数据,比如某个分片具有太多的副本,通过页面上的删除掉(unload)该副本,实际该副本的元数据信息被从ZooKeeper集群维护的信息中删除,在具体的节点上的副本数据并没有删除,而只是处于离线状态,不能提供服务。
搜索数据
我们可以执行搜索,执行如下搜索条件:
http://master:8888/solr-cloud/mycollection/select?q=北京 纽约&fl=*&fq=category:公寓&fq=building_type:2&start=0&rows=10
搜索结果,如下所示:
<response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">570</int> </lst> <result name="response" numFound="201568" start="0" maxScore="1.5322487"> <doc> <long name="id">37109751480918</long> <long name="_version_">1442164237143113728</long> </doc> <doc> <long name="id">37126929150371</long> <long name="_version_">1442164255154503680</long> </doc> <doc> <long name="id">37445266827945</long> <long name="_version_">1442164588949798912</long> </doc> <doc> <long name="id">37611390043867</long> <long name="_version_">1442164763138195456</long> </doc> <doc> <long name="id">37892268870281</long> <long name="_version_">1442165057653833728</long> </doc> <doc> <long name="id">89820941817153</long> <long name="_version_">1442219517734289408</long> </doc> <doc> <long name="id">89825667635450</long> <long name="_version_">1442219522665742336</long> </doc> <doc> <long name="id">89830029550692</long> <long name="_version_">1442219527207124993</long> </doc> <doc> <long name="id">93932235463589</long> <long name="_version_">1442223828610580480</long> </doc> <doc> <long name="id">93938975733467</long> <long name="_version_">1442223835684274177</long> </doc> </result> </response>
可以查看对应的日志,示例如下所示:
2013-08-05 18:38:26.814 [http-bio-8888-exec-228] INFO org.apache.solr.core.SolrCore – [mycollection_shard1_0_replica2] webapp=/solr-cloud path=/select params={NOW=1375699145633&shard.url=10.95.3.62:8888/solr-cloud/mycollection_shard1_0_replica1/|10.95.3.61:8888/solr-cloud/mycollection_shard1_0_replica3/&fl=id,score&start=0&q=北京+纽约&distrib=false&wt=javabin&isShard=true&fsv=true&fq=category:公寓&fq=building_type:2&version=2&rows=10} hits=41529 status=0 QTime=102 2013-08-05 18:39:06.203 [http-bio-8888-exec-507] INFO org.apache.solr.core.SolrCore – [mycollection_shard3_replica1] webapp=/solr-cloud path=/select params={fl=*&start=0&q=北京+纽约&fq=category:公寓&fq=building_type:2&rows=10} hits=201568 status=0 QTime=570
相关问题
1、我在进行Collection的创建的时候,当前有4个节点,在ZooKeeper集群中注册,执行如下命令:
[hadoop@slave1 multicore]$ curl 'http://slave1:8888/solr-cloud/admin/collections?action=CREATE&name=tinycollection&numShards=2&replicationFactor=3'
出现异常:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">400</int> <int name="QTime">81</int> </lst> <str name="Operation createcollection caused exception:">org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Cannot create collection tinycollection. Value of maxShardsPerNode is 1, and the number of live nodes is 4. This allows a maximum of 4 to be created. Value of numShards is 2 and value of replicationFactor is 3. This requires 6 shards to be created (higher than the allowed number)</str> <lst name="exception"> <str name="msg">Cannot create collection tinycollection. Value of maxShardsPerNode is 1, and the number of live nodes is 4. This allows a maximum of 4 to be created. Value of numShards is 2 and value of replicationFactor is 3. This requires 6 shards to be created (higher than the allowed number)</str> <int name="rspCode">400</int> </lst> <lst name="error"> <str name="msg">Cannot create collection tinycollection. Value of maxShardsPerNode is 1, and the number of live nodes is 4. This allows a maximum of 4 to be created. Value of numShards is 2 and value of replicationFactor is 3. This requires 6 shards to be created (higher than the allowed number)</str> <int name="code">400</int> </lst> </response>
根据上面异常信息可知,当前有4个节点可用,但是我在创建Collection的时候,指定两个Shard,同时复制因子是3,所以最低要求,需要6个节点。所以,可以减少复制因子,例如replicationFactor=2,表示一共存在两个副本(Leader分片和另一个副本),然后再执行创建Collection的操作就不会报错了。
本文基于署名-非商业性使用-相同方式共享 4.0许可协议发布,欢迎转载、使用、重新发布,但务必保留文章署名时延军(包含链接:http://shiyanjun.cn),不得用于商业目的,基于本文修改后的作品务必以相同的许可发布。如有任何疑问,请与我联系。
写的真好